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Preface 


This book is all about the different statistical methods 

st J-° yed , m f h ® ! leld of a PP lled biology. Essentially these 
statistical methods and techniques were developed in the 

medical and biological sciences. However, these are also 

ley used in the social sciences as well as in the field of 
engineering and technology. 


Role of experimental design in applied biology can not 
e underestimated. Experimental methods are widely used 
m research as well as in industrial settings, however 
sometimes for very different purposes. The primary goal in 
scientific research is usually to show the. statistical 
significance of an effect, that a particular factor exerts on 
the dependent variable of interest. 


In the industrial settings, the primary goal is usually to 
extract the maximum amount of unbiased information 
regarding the factors affecting a production process from as 
minimum observations as possible. While in the former 
application (in science) analysis of variance ( ANOVA ) 
techniques are used to uncover the interactive nature of 
reality, as manifested in higher-order interactions of factors 
in Indus” nuisance” they are often of no interest; they only 
complicate the process of identifying important factors. 

Major areas of study in this book are: Elementary 
Concepts in Statistics; Basic Statistics; Nonparametric 
Statistics; ANOVA/MANOVA; Process Analysis; Survival/ 
Failure Time Analysis; Quality Control Charts; 
Representative Visualization Techniques; and Experimental 
Design. 
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Elementary Concepts in Statistics 


In this introduction, we will briefly discuss those elementary 
statistical concepts that provide the necessary foundations 
for more specialized expertise in any area of statistical data 
analysis. The selected topics illustrate the basic assumptions 
of most statistical methods and/or have been demonstrated 
in research to be necessary components of one’s general 
understanding of the “quantitative nature” of reality 
(Nisbett, et al., 1987). Because of space limitations, we will 
focus mostly on the functional aspects of the concepts 
discussed and the presentation will be very short. Further 
information on each of those concepts can be found in the 
Introductory Overview and Examples sections of this manual 
and in statistical textbooks. Recommended introductory 
textbooks are: Kachigan (1986), and Runyon and Haber 
(1976); for a more advanced discussion of elementary theory 
and assumptions of statistics, see the classic books by Hays 
(1988), and Kendall and Stuart (1979). 

What are Variables? 

Variables are things that we measure, control, or 
manipulate in research. They differ in many respects, most 
notably in the role they are given in our research and in the 
type of measures that can be applied to them. 
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Correlational vs. experimental research: Most 
empirical research belongs clearly to one of those two general 
categories. In correlational research we do not (or at least 
try not to) influence any variables but only measure them 
and look for relations (correlations) between some set of 
variables, such as blood pressure and cholesterol level. In 
experimental research, we manipulate some variables and 
then measure the effects of this manipulation on other 
variables; for example, a researcher might artificially 
increase blood pressure and then record cholesterol level. 
Data analysis in experimental research also comes down to 
calculating “correlations” between variables, specifically, 
those manipulated and those affected by the manipulation. 
However, experimental data may potentially provide 
qualitatively better information: Only experimental data can 
conclusively demonstrate causal relations between variables. 
For example, if we found that whenever we change variable 
A then variable B changes, then we can conclude that “A 
influences B.” Data from correlational research can only be 
“interpreted” in causal terms based on some theories that 
we have, but correlational data cannot conclusively prove 

causality. 

Dependent vs. independent variables: Independent 
variables are those that are manipulated whereas dependent 
variables are only measured or registered. This distinction 
appears terminologically confusing to many because, as some 
students say, “all variables depend on something.” However, 
once you get used to this distinction, it becomes 
indispensable. The terms dependent and independent 
variable apply mostly to experimental research where some 
variables are manipulated, and in this sense they are 
“independent” from the initial reaction patterns, features, 
intentions, etc. of the subjects. Some other variables are 
expected to be “dependent” on the manipulation or 
experimental conditions. That is to say, they depend on 
“what the subject will do” in response. Somewhat contrary 
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to the nature of this distinction, these terms are also used 
in studies where we do not literally manipulate independent 
variables, but only assign subjects to “experimental groups” 
based on some pre-existing properties of the subjects. For 
example, if in an experiment, males are compared with 
females regarding their white cell count (WCC), Gender 
could be called the independent variable and WCC the 
dependent variable. 

Measurement scales: Variables differ in “how well” 
they can be measured, i.e., in how much measurable 
information their measurement scale can provide. There is 
obviously some measurement error involved in every 
measurement, which determines the “amount of information” 
that we can obtain. Another factor that determines the 
amount of information that can be provided by a variable is 
its “type of measurement scale.” Specifically variables are 
classified as (a) nominal, ( b ) ordinal, (c) interval or (d) ratio. 

Nominal variables allow for only qualitative 
classification. That is, they can be measured only in terms 
of whether the individual items belong to some distinctively 
different categories, but we cannot quantify or even rank 
order those categories. For example, all we can say is that 
two individuals are different in terms of variable A (e.g., 
they are of different race), but we cannot say which one 
“has more” of the quality represented by the variable. 
Typical examples of nominal variables are gender, race, 
colour, city, etc. 

Ordinal variables allow us to rank order the items we 
measure in terms of which has less and which has more of 
the quality represented by the variable, but still they do not 
allow us to say "how much more.” A typical example of an 
ordinal variable is the socioeconomic status of families. For 
example, we know that upper-middle is higher than middle 
but we cannot say that it is, for example, 18% higher. Also 
t is very distinction between nominal, ordinal, and interval 
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scales itself represents a good example of an ordinal variable. 
For example, we can say that nominal measurement provides 
less information than ordinal measurement, but we cannot 
say “how much less” or how this difference compares to the 
difference between ordinal and interval scales. 

Interval variables allow us not only to rank order the 
items that are measured, but also to quantify and compare 
the sizes of differences between them. For example, 
temperature, as measured in degrees Fahrenheit or Celsius, 
constitutes an interval scale. We can say that a temperature 
of 40 degrees is higher than a temperature of 30 degrees, 
and that an increase from 20 to 40 degrees is twice as much 
as an increase from 30 to 40 degrees. 

Ratio variables are very similar to interval variables, in 
addition to all the properties of interval variables, they 
feature an identifiable absolute zero point, thus they allow 
for statements such as x is two times more than y. Typical 
' examples of ratio scales are measures of time or space. For 
example, as the Kelvin temperature scale is a ratio scale, 
not only can we say that a temperature of 200 degrees is 
higher than one of 100 degrees, we can correctly state that 
it is twice as high. Interval scales do not have the ratio 
property. Most statistical data analysis procedures do not 
distinguish between the interval and ratio properties of the 
measurement scales. 

Relations between variables: Regardless of their type, 
two or more variables are related if in a sample of 
observations, the values of those variables are distributed 
in a consistent manner. In other words, variables are related 
if their values systematically correspond to each other for 
these observations. For example, Gender and WCC would 
be considered to be related if most males had high WCC 
and most females low WCC, or vice versa; Height is related 
to Weight because typically tall individuals are heavier than 
short ones; IQ is related to the Number of Errors in a test, 
if people with higher IQ’s make fewer errors. 
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Why relations between variables are important: 
Generally speaking, the ultimate goal of every research or 
scientific analysis is finding relations between variables. 
The philosophy of science teaches us that there is no other 
way of representing “meaning” except in terms of relations 
between some quantities or qualities; either way involves 
relations between variables. Thus, the advancement of 
science must always involve finding new relations between 
variables. Correlational research involves measuring such 
relations in the most straightforward manner. However, 
experimental research is not any different in this respect. 
For example, the above mentioned experiment comparing 
WCC in males and females can be described as looking for 
a correlation between two variables: Gender and WCC. 
Statistics does nothing else but help us evaluate relations 
between variables. Actually, all of the hundreds of 
procedures that are described in this manual can be 
interpreted in terms of evaluating various kinds of inter¬ 
variable relations. 

Two basic features of every relation between 
variables: The two most elementary formal properties of 
every relation between variables are the relation’s (a) 
magnitude (or “size”) and (b) its reliability (or “truthfulness”). 

Magnitude (or size ). The magnitude is much easier to 
understand and measure than reliability. For example, if 
every male in our sample was found to have a higher WCC 
than any female in the sample, we could say that the 
magnitude of the relation between the two variables (Gender 
and WCC) is very high in our sample. In other words, we 
could predict one based on the other (at least among’ the 
members of our sample). 

Reliability (or “truthfulness”): The reliability of a 
relation is a much less intuitive concept, but still extremely 
important. It pertains to the “representativeness” of the 
lesult found in our specific sample for the entire population. 
In other words, it says how probable it is that a similar 
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relation would be found if the experiment was replicated 
with other samples drawn from the same population. 
Remember that we are almost never “ultimately interested 
only in what is going on in our sample; we are interested in 
the sample only to the extent it can provide information 
about the population. If our study meets some specific 
criteria (to be mentioned later), then the reliability of a 
relation between variables observed in our sample can be 
quantitatively estimated and represented using a standard 
measure (technically called p-value or statistical significance 
level, see the next paragraph). 

What is “statistical significance” (p-value): The 
statistical significance of a result is the probability that the 
observed relationship (e.g., between variables) or a difference 
(e.g., between means) in a sample occurred by pure chance 
(“luck of the draw”), and that in the population from which 
the sample was drawn, no such relationship or differences 
exist. Using less technical terms, one could say that the 
statistical significance of a result tells us something about 
the degree to which the result is “true” (in the sense of 
being “representative of the population”). More technically, 
the value of the p-value represents a decreasing index of 
the reliability of a result (see Brownlee, 1960). The higher 
the p-value, the less we can believe that the observed relation 
between variables in the sample is a reliable indicator of 
the relation between the respective variables in the 
population. Specifically, the p-value represents the 
probability of error that is involved in accepting our observed 
result as valid, that is, as “representative of the population.” 
For example, a p-value of .05 (i.e., 1/20) indicates that there 
is a 5% probability that the relation between the variables 
found in our sample is a “fluke.” In other words, assuming 
that in the population there was no relation between those 
variables whatsoever, and we were repeating experiments 
like ours one after another, we could expect that 
approximately in every 20 replications of the experiment 
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there would be one in which the relation between the 
variables in question would be equal or stronger than in 
ours. (Note that this is not the same as saying that, given 
that there IS a relationship between the variables, we can 
expect to replicate the results 5% of the time or 95% of the 
lime; when there is a relationship between the variables in 
the population, the probability of replicating the study and 
finding that relationship is related to the statistical power 
of the design. In many areas of research, the p-value of .05 
is customarily treated as a “border-line acceptable” error 
level. 

How to determine that a result is “really” 
significant: There is no way to avoid arbitrariness in the 
final decision as to what level of significance will be treated 
as really “significant.” That is, the selection of some level of 
significance, up to which the results will be rejected as 
invalid, is arbitrary. In practice, the final decision usually 
depends on whether the outcome was predicted a priori or 
only found post hoc in the course of many analyses and 
comparisons performed on the data set, on the total amount 
of consistent supportive evidence in the entire data set, and 
on “traditions” existing in the particular area of research. 
Typically, in many sciences, results that yield p < .05 are 
considered borderline statistically significant but remember 
that this level of significance still involves a pretty high 
probability of error (5%). Results that are significant at the 
p < .01 level are commonly considered statistically 
significant, and p < .005 or p < .001 levels are often called 
“highly” significant. But remember that those classifications 
represent nothing else but arbitrary conventions that are 
only informally based on general research experience. 

Statistical significance and the number of analyses 
performed: Needless to say, the more analyses you perform 
on a data set, the more results will meet “by chance” the 
conventional significance level. For example, if you calculate 
correlations between ten variables (i.e., 45 different 
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correlation coefficients), then you should expect to find by 
chance that about two ( i.e ., one in every 20) correlation 
coefficients are significant at the p < .05 level, even if the 
values of the variables were totally random and those 
variables do not correlate in thh population. Some statistical 
methods that involve many comparisons, and thus a good 
chance for suph errors, include some “correction” or 
adjustment for the total number of comparisons. However 
many statistical methods (especially simple exploratory data 
analyses), do not offer any straightforward remedies to this 
'Problem. Therefore, it is up to the researcher to carefully 
evaluate the reliability of unexpected findings. Many 
examples in this manual offer specific advice on how to do 
this; relevant information can also be found in most research 
methods textbooks. 


Strength vs. reliability of a relation between 
variables: We Said before that strength and reliability are 
two different features of relationships between variables. 
However, they are not totally independent. In general, in a 
sample of a particular size, the larger the magnitude of the 
relation between variables, the more reliable the relation 
(see the next paragraph). 

Why stronger relations between variables are 
more significant: Assuming that there is no relation 
between the respective variables in the population, the most 
likely outcome would be also finding no relation between 
those variables in the research sample. Thus, the stronger 
the relation found in the sample, the less likely it is that 
there is no corresponding relation in the population. As you 
see, the magnitude and significance of a relation appear to 
be closely related, and we could calculate the significance 
from the magnitude and vice-versa; however, this is true 
only if the sample size is kept constant, because the relation 
of a given strength could be either highly significant or not 
significant at all, depending on the sample size. 
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Why significance of a relation between variables 
depends on the size of the sample: If there are very few 
observations, then there are also respectively few possible 
combinations of the values of the variables, and thus the 
probability of obtaining by chance a combination of those j 
values indicative of a Strong relation is relatively high. 
Consider the following illustration. If we are interested in 
two variables (Gender: male/female and WCC: high/low) and 
there are only four subjects in our sample (two males and 
two females), then the probability that we will find, purely 
by chance, a 100% relation between the two variables can 
be as high as one-eighth. Specifically, there is a one-in¬ 
eight chance that both males will have a high WCC and 
both females a low WCC, or vice versa. 

Now consider the probability of obtaining such a perfect 
match by chance if our sample consisted of 100 subjects; 
the probability of obtaining such an outcome by chance would 
be practically zero. Let’s look at a more general example. 
Imagine a theoretical population in which the average value 
of WCC in males and females is exactly the same. Needless 
to say, if we start replicating a simple experiment by drawing 
pairs of samples (of males and females) of a particular size 
from this population and calculating the difference between 
the average WCC in each pair of samples, most of the 
experiments will yield results close to 0. However, from 
time to time, a pair of samples will be drawn where the 
difference between males and females will be quite different 
from 0. How ,often will it happen? The smaller the sample 
size in each experiment, the more likely it is that we will 
obtain such erroneous results, which in this case would be 
results indicative of the existence of a relation between 

gender and WCC obtained from a population in which such 
a relation does not exist. 

Example: “Baby boys to baby girls ratio.” Consider the 
following example from research on statistical reasoning 
(Nisbett, et, al., 1987). There are two hospitals: in the first 
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one, 120 babies are born every day, in the other, only 12. 
On average, the ratio of baby boys to baby girls born every 
day in each hospital is 50/50. However, one day, in one of 
those hospitals twice as many baby girls were born as baby 
boys. In which hospital was it more likely to happen? The 
answer is obvious for a statistician, but as research shows, 
not so obvious for a layperson: It is much more likely to 
happen in the small hospital. The reason for this is that 
technically speaking, the probability of a random deviation 
of a particular size (from the population mean), decreases 
with the increase in the sample size. 

Why small relations can be proven significant only 
in large samples: The examples in the previous paragraphs 
indicate that if a relationship between variables in question 
is “objectively” (i.e., in the population) small, then there is 
no way to identify such a relation in a study unless the 
research sample is correspondingly large. Even if our sample 
is in fact “perfectly representative” the effect will not be 
statistically significant if the sample is small. Analogously, 
if a relation in question is “objectively” very large (i.e., in 
the population), then it can be found to be highly significant 
even in a study based on a very small sample. Consider the 
following additional illustration. If a coin is slightly 
asymmetrical, and when tossed is somewhat more likely to 
produce heads than tails (e.g., 60% vs. 40%), then ten tosses 
would not be sufficient to convince anyone that the coin is 
asymmetrical, even if the outcome obtained (six heads and 
four tails) was perfectly representative of the bias of the 
coin. However, is it so that 10 tosses is not enough to prove 
anything? No, if the effect in question were large enough, 
then ten tosses could be quite enough. For instance, imagine 
now that the coin is so asymmetrical that no matter how 
you toss it, the outcome will be heads. If you tossed such a 
coin ten times and each toss produced heads, most people 
would consider it sufficient evidence that something is 
“wrong” with the coin. In other words, it would be considered 


Elementary Concepts in Statistics 


11 


convincing evidence that in the theoretical population of an 
infinite number of tosses of this coin there would be more 
heads than tails. Thus, if a relation is large, then it can be 
found to be significant even in a small sample. 

Can “no relation” be a significant result? The 
smaller the relation between variables, the larger the sample 
size that is necessary to prove it significant. For example, 
imagine how many tosses would be necessary to prove that 
a coin is asymmetrical if its bias were only .000001%! Thus, 
the necessary minimum sample size increases as the 
magnitude of the effect to be demonstrated decreases. When 
the magnitude of the effect approaches 0, the necessary 
sample size to conclusively prove it approaches infinity. That 
is to say, if there is almost no relation between two variables, 
then the sample size must be almost equal to the population 
size, which is assumed to be infinitely large. Statistical 
significance represents the probability that a similar outcome 
would be obtained if we tested the entire population. Thus, 
everything that would be found after testing the entire 
population would be, by definition, significant at the highest 
possible level, and this also includes all “no relation” results. 

How to measure the magnitude (strength) of 
relations between variables: There are very many 
measures of the magnitude of relationships between 
variables which have been developed by statisticians; the 
choice of a specific measure in given circumstances depends 
on the number of variables involved, measurement scales 
used, nature of the relations, etc. Almost all of them, 
however, follow one general principle: they attempt to 
somehow evaluate the observed relation by comparing it to 
the “maximum imaginable relation” between those specific 
variables. Technically speaking, a common way to perform 
such evaluations is to look at how differentiated are the 
values of the variables, and then calculate what part of this 
overall available differentiation” is accounted for by 
instances when that differentiation is “common” in the two 
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(or more) variables in question. Speaking less technically 
we compare “what is common in those variables” to “what 
potentially could have been common if the variables were 
perfectly related.” Let us consider a simple illustration. Let 
us say that in our sample, the average index of VVCC is 100 
in males and 102 in females. Thus, we could say that on 
average, the deviation of each individual score from the 
grand mean (101) contains a component due to the gender 
of the subject: the size of this component is 1. That value, 
in a sense, represents some measure of relation between 
Gender and WCC. However, this value is a very poor 
measure, because it does not tell us how relatively large 
this component is, given the “overall differentiation” of WCC 
scores. Consider two extreme possibilities: 

If all WCC scores of males were equal exactly to 100, 
and those of females equal to 102, then all deviations from 
the grand mean in our sample would be entirely accounted 
for by gender. We would say that in our sample, gender is 
perfectly correlated with WCC, that is, 100% of the observed 
differences between subjects regarding their WCC is 
accounted for by their gender. 

If WCC scores were in the range of 0-1000, the same 
difference (of 2) between the average WCC of males and 
females found in the study would account for such a small 
part of the overall differentiation of scores that most likely 
it would be considered negligible. For example, one more 
subject taken into account could change, or even reverse 
the direction of the difference. Therefore, every good measure 
of relations between variables must take into account the 
overall differentiation of individual scores in the sample 
and evaluate the relation in terms of (relatively) how much 
of this differentiation is accounted for by the relation in 
question. 


Common 

Because the 


“general format” of most statistical tests: 
ultimate goal of most statistical tests is to 
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evaluate relations between variables, most statistical tests 
follow the general format that was explained in the previous 
paragraph. Technically speaking, they represent a ratio of 
some measure of the differentiation common in the variables 
in question to the overall differentiation of those variables. 
For example, they represent a ratio of the part of the oveiall 
differentiation of the WCC scores that can be accounted for 
by gender to the overall differentiation of the WCC scores. 
This ratio is usually called a ratio of explained variation to 
total variation. In statistics, the term explained variation 
does not necessarily imply that we “conceptually understand 
it. It is used only to denote the common variation in the 
variables in question, that is, the part of variation in one 
variable that is “explained” by the specific values of the 
other variable, and vice versa. 

How the “level of statistical significance” is 
calculated: Let us assume that we have already calculated 
a measure of a relation between two variables (as explained 
above). The next question is “how significant is this relation?” 
For example, is 40% of the explained variance between the 
two variables enough to consider the relation significant? 
The answer is “it depends.” Specifically, the significance 
depends mostly on the sample size. As explained before, in 
very large samples, even very small relations between 
variables will be significant, whereas in very small samples 
even very large relations cannot be considered reliable 
(significant). Thus, in order to determine the level of 
statistical significance, we need a function that represents 
the relationship between “magnitude” and “significance” of 
relations between two variables, depending on the sample 
size. The Junction we need would telkus exactly “how likely 
it is to obtain a relation of a given magnitude (or larger) 
from a sample of a given size, assuming that there is no 
such relation between those variables irixthe population.” In 
other words, that function would give, us the significance 
(p) level, and it would tell us the probability of error involved 
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in rejecting the idea that the relation in question does not 
exist in the population. This “alternative” hypothesis (that 
there is no relation in the population) is usually called the 
null hypothesis. It would be ideal if the probability function 
was linear, and for example, only had different slopes for 
different sample sizes. Unfortunately, the function is more 
complex, and is not always exactly the same; however, in 
most cases we know its shape and can use it to determine 
the significance levels for our findings in samples of a 
particular size. Most of those functions are related to a 
general type of function, which is called normal. 

Why the “Normal distribution” is important: The 
“Normal distribution” is important because in most cases, it 
well approximates the function that was introduced in the 
previous paragraph. The distribution of many test statistics 
is normal or follows some form that can be derived from the 
normal distribution. In this sense, philosophically speaking, 
the Normal distribution represents one of the empirically 
verified elementary “truths about the general nature of 
reality,” and its status can be compared to the one of 
fundamental laws of natural sciences. The exact shape of 
the normal distribution (the characteristic “bell curve”) is 
defined by a function, which has only two parameters: mean 
and standard deviation. 

A characteristic property of the Normal distribution is 
that 68% of all of its observations fall within a range of ±1 
standard deviation from the mean, and a range of ±2 
standard deviations includes 95% of the scores. In other 
words, in a Normal distribution, observations that have a 
standardized value of less than -2 or more than +2 have a 
relative frequency of 5% or less. (Standardized value means 
that a value is expressed in terms of its difference from the 
mean, divided by the standard deviation.) If you have access 
to STATISTICA, you can explore the exact values of 
probability associated with different values in the normal 
distribution using the interactive Probability Calculator tool; 
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for example, if you enter the Z value ( i.e ., standardized 
value) of 4, the associated probability computed by 
STATISTICA will be less than .0001, because in the normal 
distribution almost all observations (i.e., more than 99.99%) 
fall within the range of ±4 standard deviations. The 
animation below shows the tail area associated with other 
Z values. 



Illustration of how the normal distribution is used in 
statistical reasoning (induction). Recall the example 
discussed above, where pairs of samples of males and females 
were drawn from a population in which the average value 
of WCC in males and females was exactly the same. 
Although the most likely outcome of such experiments (one 
pair of samples per experiment) was that the difference 
between the average WCC in males and females in each 
pair is close to zero, from time to time, a pair of samples 
will be drawn where the difference between males and 
females is quite different from 0. How often does it happen? 
If the sample size is large enough, the results of such 
replications are “normally distributed” (this important 
principle is explained and illustrated in the next paragraph), 
and thus knowing the shape of the normal curve, we can 
precisely calculate the probability of obtaining “by chance” 
outcomes representing various levels of deviation from the 
hypothetical population mean of 0. If such a calculated 
probability is so low that it meets the previously accepted 
criterion of statistical significance, then we have only one 
choice: conclude that our result gives a better approximation 
of what is going on in the population than the “null 
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hypothesis” (remember that the null hypothesis was 
considered only for “technical reasons” as a benchmark 
against which our empirical result was evaluated). Note 
that this entire reasoning is based on the assumption that 
the shape of the distribution of those “replications” 
(technically, the “sampling distribution”) is normal. This 
assumption is discussed in the next paragraph. 

Are all test statistics normally distributed? Not 
all, but most of them are either based on the normal 
distribution directly or on distributions that are related to, 
and can be derived from normal or Chi-square. Typically, 
those tests require that the variables analysed are 
themselves normally distributed in the population, that is, 
they meet the so-called “normality assumption.” Many 
observed variables actually are normally distributed, which 
is another reason why the normal distribution represents a 
“general feature” of empirical reality. The problem may occur 
when one tries to use a normal distribution-based test to 
analyse data from variables that are themselves not normally 
distributed. In such cases we have two general choices. First, 
we can use some alternative “nonparametric” test but this 

> 

is often inconvenient because such tests are typically less 
powerful and less flexible in terms of types of conclusions 
that they can provide. Alternatively, in many cases we can 
still use the normal distribution-based test if we only make 
sure that the size of our samples is large enough. The latter 
option is based on an extremely important principle, which 
is largely responsible for the popularity of tests that are 
based on the normal function. Namely, as the sample size 
increases, the shape of the sampling distribution (i.e., 
distribution of a statistic from the sample; this term was 
first used by Fisher, 1928a) approaches normal shape, even 
if the distribution of the variable in question is not normal. 

This principle is illustrated in the following animation 
showing a series of sampling distributions (created with 
gradually increasing sample sizes of: 2, 5, 10, 15, and 30) 
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using a variable that is clearly non-normal in the population, 
that is, the distribution of its values is clearly skewed. 


'i .mpU 



However, as the sample size (of samples used to create 
the sampling distribution of the mean) increases, the shape 
of the sampling distribution becomes normal. Note that for 
n=30, the shape of that distribution is “almost” perfectly 
normal. This principle is called the central limit theorem. 


How do we know the consequences of violating 
the normality assumption? Although many of the 
statements made in the preceding paragraphs can be proven 
mathematically, some of them do not have theoretical proofs 
and can be demonstrated only empirically, via so-called 
Monte-Carlo experiments. In these experiments, a computer 
following predesigned specifications generates large numbers 
of samples and the results from such samples are analysed 
using a variety of tests. This way we can empirically evaluate 
the type and magnitude of errors or biases to which we are 
exposed when certain theoretical assumptions of the tests 
we are using are not met by our data. Specifically, Monte- 
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Carlo studies were used extensively with normal 
distribution-based tests to determine how sensitive they are 
to violations of the assumption of normal distribution of the 
analysed variables in the population. The general conclusion 
from these studies is that the consequences of such violations 
are less severe than previously thought. Although these 
conclusions should not entirely discourage anyone from being 
concerned about the normality assumption, they have 
increased the overall popularity of the distribution-dependent 
statistical tests in all areas of research. 
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“True” Mean and Confidence Interval 

Probably the most often used descriptive statistic is the mean. 
The mean is a particularly informative measure of the 
central tendency” of the variable if it is reported along with 
its confidence intervals. As mentioned earlier, usually we 
are interested in statistics (such as the mean) from our 
sample only to the extent to which they can infer information 
about the population. The confidence intervals for the mean 
give us a range of values around the mean where we expect 
the “true” (population) mean is located. For example, if the 
mean in your sample is 23, and the lower and upper limits 
of the p= 05 confidence interval are 19 and 27 respectively, 
then you can conclude that there is a 95% probability that 
the population mean is greater than 19 and lower than 27. 
If you set the p-level to a smaller value, then the interval 
would become wider thereby increasing the “certainty” of 
the estimate, and vice versa ,; as we all know from the weather 
forecast, the more “vague” the prediction (i.e., wider the 
confidence interval), the more likely it will materialize. Note 
that the width of the confidence interval depends on the 
sample size and on the variation of data values. The larger 
the sample size, the more reliable its mean. The larger the 
variation, the less reliable the mean. The calculation of 
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confidence intervals is based on the assumption that the 
variable is normally distributed in the population. The 
estimate may not be valid if this assumption is not met 
unless the sample size is large, say zi=100 or more. 


Applied Biology 


Shape of the Distribution 

Normality, An important aspect of the “description” of a 
variable is the shape of its distribution, which tells you the 
frequency of values from different ranges of the variable. 
Typically, a researcher is interested in how well the 
distribution can be approximated by the normal distribution 
(see the animation below for an example of this distribution. 
Simple descriptive statistics can provide some information 
relevant to this issue. For example, if the skewness (which 
measures the deviation of the distribution from symmetry) 
is clearly different from 0, then that distribution is 
asymmetrical, while normal distributions are perfectly 
symmetrical. If the kurtosis (which measures “peakedness” 
of the distribution) is clearly different from 0, then the 
distribution is either flatter or more peaked than normal; 
the kurtosis of the normal distribution is 0. 



More precise information can be obtained by performing 
one of the tests of normality to determine the probability 
that the sample came from a normally distributed population 
of observations (e.g., the so-called Kolmogorov-Smimov test, 
or the Shapiro-Wilks’ W test. However, none of these tests 
can entirely substitute for a visual examination of the data 
using a histogram (i.e., a graph that shows the frequency 
distribution of a variable). 
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Correlations 

Purpose (What is Correlation?) Correlation is a measure 
of the relation between two or more variables. The 
measurement scales used should be at least interval scales, 
but other correlation coefficients are available to handle 
other types of data. Correlation coefficients can range from 
-1.00 to +1.00. The value of -1.00 represents a perfect 
negative correlation while a value of +1.00 represents a 
perfect positive correlation. A value of 0.00 represents a 

Ai-tioG 


The graph allows you to evaluate the normality of the 
empirical distribution because it also shows the normal curve 
superimposed over the histogram. It also allows you to 
examine various aspects of the distribution qualitatively. 
For example, the distribution could be bimodal (have 2 
peaks). This might suggest that the sample is not 
homogeneous but possibly its elements came from two 
different populations, each more or less normally distributed. 
In such cases, in order to understand the nature of the 
variable in question, you should look for a way to 
quantitatively identify the two sub-samples. 
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lack of correlation. 





The most widely-used type of correlation coefficient is 
Pearson r, also called linear or product-moment correlation. 

Simple Linear Correlation (Pearson r): Pearson 
correlation (hereafter called correlation ), assumes that the 
two variables are measured on at least interval scales, and 
it determines the extent to which values of the two variables 
are “proportional” to each other. The value of correlation 
(i.e., correlation coefficient) does not depend on the specific 
measurement units used; for example, the correlation 
between height and weight will be identical regardless of 
whether inches and pounds, or centimeters and kilograms 
are used as measurement units. Proportional means linearly 
related', that is, the correlation is high if it can be 
“summarized” by a straight line (sloped upwards or 
downwards). 
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This line is called the regression line or least squares 
line , because it is determined such that the sum of the 
squared distances of all the data points from the line is the 
lowest possible. Note that the concept of squared distances 
will have important functional consequences on how the 
value of the correlation coefficient reacts to various specific 
arrangements of data (as we will later see). 

How to Interpret the Values of Correlations: As 

mentioned before, the correlation coefficient (r) represents 
the linear relationship between two variables. If the 
correlation coefficient is squared, then the resulting value 
(r , the coefficient of determination) will represent the 
proportion of common variation in the two variables (Le. 
the strength” or “magnitude” of the relationship). In order 
to evaluate the correlation between variables, it is important 

to know this “magnitude” or “strength” as well as the 
significance of the correlation. 
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Significance of Correlations: The significance level 
calculated for each correlation is a primary source of 
information about the reliability of the correlation. As 
explained before, the significance of a correlation coefficient 
of a particular magnitude will change depending on the size 
of the sample from which it was computed. The test of 
sig nifican ce is based on the assumption that the distribution 
of the residual values (i.e ., the deviations from the regression 
line) for the dependent variable y follows the normal 
distribution, and that the variability of the residual values 
is the same for all values of the independent variable 
However, Monte-Carlo studies suggest that meeting those 
assumptions closely is not absolutely crucial if your sample 
size is not very small and when the departure from normality 
is not very large. It is impossible to formulate precise 
recommendations based on those Monte-Carlo results, but 
many researchers follow a rule of thumb that if your sample 
size is 50 or more then serious biases are unlikely, and if 
your sample size is over 100 then you should not be 
concerned at all with the normality assumptions. There are, 
however, much more common and serious threats to the 
validity of information that a correlation coefficient can 
provide; they are briefly discussed in the following 
paragraphs. 
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Outliers: Outliers are atypical (by definition), infrequent 
observations. Because of the way in which the regression 
line is determined (especially the fact that it is based on 
minimizing not the sum of simple distances but the sum of 
squares of distances of data points from the line), outliers 
have a profound influence on the slope of the regression 
line and consequently on the value of the correlation 
coefficient. A single outlier is capable of considerably 
changing the slope of the regression line and, consequently, 
the value of the correlation, as demonstrated in the following 
example. Note, that as shown on that illustration, just one 
outlier can be entirely responsible for a high value of the 
correlation that otherwise (without the outlier) would be 
close to zero. Needless to say, one should never base 
important conclusions on the value of the correlation 
coefficient alone ( i.e ., examining the respective scatterplot 
is always recommended). 



Note that if the sample size is relatively small, then 
including or excluding specific data points that are not as 
clearly “outliers” as the one shown in the previous example 
may have a profound influence on the regression line (and 
the correlation coefficient). This is illustrated in the following 
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example where we call the points being excluded “outliers;” 
one may argue, however, that they are not outliers but 
rather extreme values. 

Typically, we believe that outliers represent a random 
error that we would like to be able to control. Unfortunately, 
there is no widely accepted method to remove outliers 
automatically (however, see the next paragraph), thus what 
we are left with is to identify any outliers by examining a 
scatterplot of each important correlation. Needless to say, 
outliers may not only artificially increase the value of a 
correlation coefficient, but they can also decrease the value 
of a “legitimate” correlation. 


Quantitative Approach to Outliers: Some researchers 
use quantitative methods to exclude outliers. For example, 
they exclude observations that are outside the range of ±2 
standard deviations (or even ±1.5 sd*s), around the group or 
design cell mean. In some areas dfy^eg£cJ^S&^ “cleaning” 
of the data is absolutely necessary . For example, in Cognitive 
psychology research on reaction times, even if almost all 
scores in an experiment are in the range of.300-700 
milliseconds , just a few “distracted reactions” of 10-15 


seconds will completely change the overall picture. 
Unfortunately, defining an outlier is subjective (as it should 
be), and the decisions concerning how to identify them must 
be made on an individual basis (taking into account specific 
experimental paradigms and/or “accepted practice” and 
general research experience in the respective area). It should 
also be noted that in some rare cases, the relative frequency 
of outliers across a number of groups or cells of a design 
can be subjected to analysis and provide interpretable 
results. For example, outliers could be indicative of the 
occurrence of a phenomenon that is qualitatively different 
than the typical pattern observed or expected in the sample, 
thus the relative frequency of outliers could provide evidence 
o a relative frequency of departure from the process or 
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phenomenon that is typical for the majority of cases in a 
group. 

Correlations in Non-homogeneous Groups: A lack 
of homogeneity in the sample from which a correlation was 
calculated can be another factor that biases the value of the 
correlation. Imagine a case where a correlation coefficient 
is calculated from data points which came from two different 
experimental groups but this fact is ignored when the 
correlation is calculated. Let us assume that the 
experimental manipulation in one of the groups increased 
the values of both correlated variables and thus the data 
from each group form a distinctive “cloud” in the scatterplot 
(as shown in the graph below). 
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In such cases, a high correlation .drgay result that is 
entirely due to the arrangement of the two groups, but 
which does not represent the “true” relation between the 


two variables, which may practically be equal to 0 (as could 

be seen if we looked at each group separately, see the 
following graph). 
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If you suspect the influence of such a phenomenon on 
your correlations and know how to identify such “subsets” 
of data, try to run the correlations separately in each subset 
of observations. If you do not know how to identify the 
hypothetical subsets, try to examine the data with some 
exploratory multivariate techniques. 

Nonlinear Relations between Variables: Another 
potential source of problems with the linear (Pearson r) 
correlation is the shape of the relation. As mentioned before, 
Pearson r measures a relation between two variables only 
to the extent to which it is linear; deviations from linearity 
will increase the total sum of squared distances from the 
regression line even if they represent a “true” and very 
close relationship between two variables. The possibility of 
such non-linear relationships is another reason why 
examining scatterplots is a necessary step in evaluating 
every correlation. For example, the following graph 
demonstrates an extremely strong correlation between the 

two variables, which is not well described by the linear 
function. 
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Measuring Nonlinear Relations: What do you do if a 
correlation is strong but clearly nonlinear (as concluded 
from examining scatterplots)? Unfortunately, there is no 
simple answer to this question, because there is no easy-to- 
use equivalent of Pearson r that is capable of handling 
nonlinear relations. If the curve is monotonous (continuously 
decreasing or increasing) you could try to transform one or 
both of the variables to remove the curvilinearity and then 
recalculate the correlation. For example, a typical 
transformation used in such cases is the logarithmic 
function, which will “squeeze” together the values at one 
end of the range. Another option available if the relation is 
monotonous is to try a nonparametric correlation, which is 
sensitive only to the ordinal arrangement of values, thus, 
by definition, it ignores monotonous curvilinearity. However, 
nonparametric correlations are generally less sensitive and 
sometimes this method will not produce any gains. 
Unfortunately, the two most precise methods are not easy 
to use and require a good deal of “experimentation” with 
the data. Therefore you could: 

Try to identify the specific function that best describes 
the curve. After a function has been found, you can test its 
“goodness-of-fit” to your data. 
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Alternatively, you could experiment with dividing one 
of the variables into a number of segments (e.g., 4 or 5) of 
an equal width, treat this new variable as a grouping 
variable and run an analysis of variance on 
the data. 

Exploratory Examination of Correlation Matrices: 
A common first step of many data analyses that involve 
more than a very few variables is to run a correlation matrix 
of all variables and then examine it for expected (and 
unexpected) significant relations. When this is done, you 
need to be aware of the general nature of statistical 
significance; specifically, if you run many tests (in this case, 
many correlations), then significant results will be found 
“surprisingly often” due to pure chance. For example, by 
definition, a coefficient significant at the .05 levels will occur 
by chance once in every 20 coefficients. There is no 
“automatic” way to weed out the “true” correlations. Thus, 
you should treat all results that were not predicted or 
planned with particular caution and look for their 
consistency with other results; ultimately, though, the most 
conclusive (although costly) control for such a randomness 
factor is to replicate the study. This issue is general and it 
pertains to all analyses that involve “multiple comparisons 
and statistical significance.” Ofe 

Casewise vs. Pairwise Deletion of Missing Data: 
The default way of deleting missing data while calculating 
a correlation matrix is to exclude all cases that have missing 
data in at least one of the selected variables; that is, by 
casewise deletion of missing data. Only this way will you 
get a “true” correlation matrix, where all correlations are 
obtained from the same set of observations. However, if 
missing data are randomly distributed across cases, you 
could easily end up with no “valid” cases in the data set, 
because each of them will have at least one missing data in 
some variable. The most common solution used in such 
instances is to use so-called pairwise deletion of missing 
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data in correlation matrices, where a correlation between 
each pair of variables is calculated from all cases that have 
valid data on those two variables. In many instances there 
is nothing wrong with that method, especially when the 
total percentage of missing data is low, say 10%, and they 
are relatively randomly distributed between cases and 
variables. However, it may sometimes lead to serious 
problems. 

For example, a systematic bias may result from a 
“hidden” systematic distribution of missing data, causing 
different correlation coefficients in the same correlation 
matrix to be based on different subsets of subjects. In 
addition to the possibly biased conclusions that you could 
derive from such “pairwise calculated” correlation matrices, 
real problems may occur when you subject such matrices to 
another analysis that expects a “true correlation matrix, 
with a certain level of consistency and “transitivity” between 
different coefficients. Thus, if you are using the pairwise 
method of deleting the missing data, be sure to examine the 
distribution of missing data across the cells of the matrix 
for possible systematic “patterns.” 

How to Identify Biases Caused by the Bias due to 
Pairwise Deletion of Missing Data: If the pairwise 
deletion of missing data does not introduce any systematic 
bias to the correlation matrix, then all those pairwise 
descriptive statistics for one variable should be very similar. 
However, if they differ, then there are good reasons to 
suspect a bias. For example, if the mean (or standard 
deviation) of the values of variable A that were taken into 
account in calculating its correlation with variable B is much 
lower than the mean (or standard deviation) of those values 
of variable A that were used in calculating its correlation 
with variable C, then we would have good reason to suspect 
that those two correlations (A-B and A-C) are based on 
different subsets of data, and thus, that there is a bias in 
the correlation matrix caused by a non-random distribution 
of missing data. 
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Pairwise Deletion of Missing Data vs Mean 
Substitution: Another common method to avoid loosing 
data due to casewise deletion is the so-called mean 
substitution of missing data (replacing all missing data in a 
variable by the mean of that variable). Mean substitution 
offers some advantages and some disadvantages as compared 
to pairwise deletion. Its main advantage is that it produces 
“internally consistent” sets of results (“true” correlation 
m atrices). The main disadvantages are. 

Mean substitution artificially decreases the variation of 
scores, and this decrease in individual variables is 
proportional to the number of missing data (i.e., the more 
missing data, the more “perfectly average scores” will be 
artificially added to the data set). Because it substitutes 
missing data with artificially created “average” data points, 
mean substitution may considerably change the values of 
correlations. 

Spurious Correlations: Although you cannot prove 
causal relations based on correlation coefficients, you can 
still identify so-called spurious correlations; that is, 
correlations that are due mostly to the influences of “other” 
variables. For example, there is a correlation between the 
total amount of losses in a fire and the number of firemen 
that were putting out the fire; however, what this correlation 
does not indicate is that if you call fewer firemen then you 
would lower the losses. There is a third variable (the initial 
size of the fire) that influences both the amount of losses 
and the number of firemen. If you “control” for this variable 
(e.g., consider only fires of a fixed size), then the correlation 
will either disappear or perhaps even change its sign. The 
main problem with spurious correlations is that we typically 
do not know what the “hidden” agent is. However, in cases 
when we know where to look, we can use partial correlations 
that control for (partial out) the influence of specified 
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Are correlation coefficients “additive?” No, they are 
not. For example, an average of correlation coefficients in a 
number of samples does not represent an “average 
correlation” in all those samples. Because the value of the 
correlation coefficient is not a linear function of the 
magnitude of the relation between the variables, correlation 
coefficients cannot simply be averaged. In cases when you 
need to average correlations, they first have to be converted 
into additive measures. For example, before averaging, you 
can square them to obtain coefficients of determination which 
are additive (as explained before in this section), or convert 
them into so-called Fisher z values, which are also additive. 

How to Determine Whether Two Correlation 
Coefficients are Significant: A test is available that will 
evaluate the significance of differences between two 
correlation coefficients in two samples. The outcome of this 
test depends not only on the size of the raw difference 
between the two coefficients but also on the size of the 
samples and on the size of the coefficients themselves. 
Consistent with the previously discussed principle, the larger 
the sample size, the smaller the effect that can be proven 
significant in that sample. In general, due to the fact that 
the reliability of the correlation coefficient increases with 
its absolute value, relatively small differences between large 
correlation coefficients can be significant. For example, a 
difference of .10 between two correlations may not be 
significant if the two coefficients are .15 and .25, although 
in the same sample, the same difference of .10 can be highly 
significant if the two coefficients are .80 and .90. 

t-test for independent samples 

Purpose, Assumptions: The t -test is the most 
commonly used method to evaluate the differences in means 
between two groups. For example, the i-test can be used to 
test for a difference in test scores between a group of patients 
who were given a drug and a control group who received a 
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placebo. Theoretically, the t-test can be used even if the 
sample sizes are very small (< e.g ., as small as 10; some 
researchers claim that even smaller n’s are possible), as 
long as the variables are normally distributed within each 
group and the variation of scores in the two groups is not 
reliably different. As mentioned before, the normality 
assumption can be evaluated by looking at the distribution 
of the data (via histograms) or by performing a normality 
test. The equality of variances assumption can be verified 
with the F test, or you can use the more robust Levene’s 
test. If these conditions are not met, then you can evaluate 
the differences in means between two groups using one of 
the nonparametric alternatives to the /- test. 


The p-level reported with a /-test represents the 
probability of error involved in accepting our research 
hypothesis about the existence of a difference. Technically 
speaking, this is the probability of error associated with 
rejecting the hypothesis of no difference between the two 
categories of observations (corresponding to the groups) in 
the population when, in fact, the hypothesis is true. Some 
researchers suggest that if the difference is in the predicted 
direction, you can consider only one half (one “tail”) of the 
probability distribution and thus divide the standard p-level 
reported with a /-test (a “two-tailed” probability) by two. 
Others, however, suggest that you should always report the 
standard, two-tailed t-test probability. 


Arrangement of Data: In order to perform the /-test 

for independent samples, one independent {grouping ) 

variable (e.g.. Gender: male/female) and at least one 

dependent variable (e.g., a test score) are required. The 

means of the dependent variable will be compared between 

selected groups based on the specified values (e.g., male 

and female) of the independent variable. The following data 

ij/^> Can ana ^y ze d with a /-test comparing the average 
WCC score in males and females. 
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GENDER 

WCC 

case 1 

male 

111 

case 2 

male 


case 3 

male 


case 4 

female 


case 5 

female 

104 


mean WCC in males = 110 


mean WCC in females = 103 


t-test graphs: In the Mest analysis, comparisons of means 
and measures of variation in the two groups can be 
visualized in box and whisker plots (for an example, see the 
graph below). 
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These graphs help you to quickly evaluate and 
“intuitively visualize” the strength of the relation between 
the grouping and the dependent variable. 


More Complex Group Comparisons: It often happens 
in research practice that you need to co m pare more than 
two groups (e.g., drug l y drug 2 , and placebo\ or compare 
groups created by more than one independent variable while 
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controlling for the separate influence of each of them (e.g., 
Gender , type of Drug, and size of Dose). In these cases, you 
need to analyze the data using Analysis of Variance , which 
can be considered to be a generalization of the £-test. In 
fact, for two group comparisons, ANOVA will give results 
identical to a *-test ( t**2 [df] = F[l,dfD • However, when the 
design is more complex, ANOVA offers numerous advantages 
that J-tests cannot provide (even if you run a series of t- 
tests comparing various cells of the design). 

t-test for dependent samples 

Within-group Variation: As explained in Elementary 
Concepts, the size of a relation between two variables, such 
as the one measured by a difference in means between two 
groups, depends to a large extent on the differentiation of 
values within the group. Depending on how differentiated 
the values are in each group, a given “raw difference” in¬ 
group means will indicate either a stronger or weaker 
relationship between the independent {grouping) and 
dependent variable. For example, if the mean WCC (White 
Cell Count) was 102 in males and 104 in females, then this 
difference of “only” 2 points would be extremely important 
if all values for males fell within a range of 101 to 103, and 
all scores for females fell within a range of 103 to 105; for 
example, we would be able to predict WCC pretty well based 
on gender. However, if the same difference of 2 was obtained 
from very differentiated scores (e.g., if their range was 0- 
200), then we would consider the difference entirely 
negligible. That is to say, reduction of the within-group 
variation increases the sensitivity of our test. 

Purpose: The £-test for dependent samples helps us to 
take advantage of one specific type of design in which an 
important source of within-group variation (or so-called, 
error ) can be easily identified and excluded from the analysis. 
Specifically, if two groups of observations (that are to be 
compared) are based on the same sample of subjects who 
were tested twice (e.g., before and after a treatment), then a 
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considerable part of the within-group variation in both 
groups of scores can be attributed to the initial individual 
differences between subjects. Note that, in a sense, this fact 
is not much different than in cases when the two groups 
are entirely independent (see £-test for independent samples), 
where individual differences also contribute to the error 
variance; but in the case of independent samples, we cannot 
do anything about it because we cannot identify (or 
“subtract”) the variation due to individual differences in 
subjects. 

However, if the same sample was tested twice, then we 
can easily identify (or “subtract”) this variation. Specifically, 
instead of treating each group separately, and analyzing 
raw scores, we can look only at the differences between the 
two measures ( e.g ., “pre-test” and “post test”) in each subject. 
By subtracting the first score from the second for each subject 
and then analyzing only those “pure (paired) differences,” 
we will exclude the entire part of the variation in our data 
set that results from unequal base levels of individual 
subjects. This is precisely what is being done in the t- test 
for dependent samples, and, as compared to the i-test for 
independent samples, it always produces “better” results 
(i.e., it is always more sensitive). 

Assumptions: The theoretical assumptions of the £-test 
for independent samples also apply to the dependent samples 
test; that is, the paired differences should be normally 
distributed. If these assumptions are clearly not met, then 
one of the nonparametric alternative tests should be used. 

Arrangement of Data: Technically, we can apply the 
i-test for dependent samples to any two variables in our 
data set. However, applying this test will make very little 
sense if the values of the two variables in the data set are 
not logically and methodologically comparable. For example, 
if you compare the average WCC in a sample of patients 
before and after a treatment, but using a different counting 
method or different units in the second measurement, then 




Statistical Methods in Applied Biology 


a highly significant *-test value could be obtained due to an 
artifact; that is, to the change of units of measurement. 
Following is an example of a data set that can be analyzed 
using the J-test for dependent samples. 



WCC 

before 

WCC 

after 

case 1 

111.9 

113 

case 2 

10.9 • 

110 

case 3 

143 

144 

case 4 

101 

102 

case 5 

80 

80.9 


average change between 


WCC "before” and "after" = 1 


The average difference between the two conditions is 
relatively small (d=l) as compared to the differentiation 
(range) of the raw scores (from 80 to 143, in the first sample). 
However, the t-test for dependent samples analysis is 
performed only on the paired differences, “ignoring” the raw 
scores and their potential differentiation. Thus, the size of 
this particular difference of 1 will be compared not to the 
differentiation of raw scores but to the differentiation of the 
individual difference scores, which is relatively small: 0.2 
(from 0.9 to 1.2). Compared to that variability, the difference 
of 1 is extremely large and can yield a highly significant t 
value. 

Matrices of t-tests: £-tests for dependent samples can 
be calculated for long lists of variables, and revie\y$d in the 
form of matrices produced with casewise or pairwise deletion 
of missing data, much like the correlation matrices. Thus, 
the precautions discussed in the context of correlations also 
apply to t -test matrices; see: the issue of artifacts caused by 
the pairwise deletion of missing data in i-tests and the 
issue of ‘randomly” significant test values. 

I 
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More Complex Group Comparisons: If there are 
niore than two “correlated samples” (e.g., before treatment , 
after treatment 1 , and after treatment 2) y then analysis of 
variance with repeated measures should be used. The 
repeated measures ANOVA can be considered a 
generalization of the t-test for dependent samples and it 

t 

offers various features that increase the overall sensitivity 
of the analysis. For example, it can simultaneously control 
not only for the base level of the dependent variable, but it 
can control for other factors and/or include in the design 
more than one interrelated dependent variable. 

Breakdown: Descriptive Statistics by Groups 

Purpose: The breakdowns analysis calculates des¬ 
criptive statistics and correlations for dependent variables 
in each of a number of groups defined by one or more 
grouping ( independent ) variables. 

Arrangement of Data: In the following example data 
set (spreadsheet), the dependent variable WCC (White Cell 
Count) can be broken down by 2 independent variables: 
Gender (values: males and females ), and Height (values: 
tall and short). 



GENDER 

HEIGHT 

WCC 

case 1 

male 

short 

101 

case 2 

male 

tall 

no 

case 3 

male 

tall 

92 

case 4 

female 

tall 

112 

case 5 

female 

short 

95 

• • • 

• • • 

• • • 

• • • 


The resulting breakdowns might look as follows (we are 
assuming that Gender was specified as the first independent 
variable, and Height as the second). 
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— Entire sample 

Mean=100 

SD=13 

N=120 


Males 


Females 


Mean=99 


Mean=101 


SD=13 


SD=13 


N=60 


N=60 

Tall/males 

Short/males 

Tall/females 

Short/females 

Mean=98 

Mean=100 

Mean=101 

Mean=101 

SD=13 

SD=13 

SD=13 

SD=13 

N=30 

N=30 

N=30 

-- 

N=30 


The composition of the “intermediate” level cells of the 
“breakdown tree” depends on the order in which independent 
variables are arranged. For example, in the above example, 
you see the means for “all males” and “all females” but you do 
not see the means for “all tall subjects” and “all short subjects” 
which would have been produced had you specified 
independent variable Height as the first grouping variable 
rather than the second. 

Statistical Tests in Breakdowns: Breakdowns are 
typically used as an exploratory data analysis technique; 
the typical question that this technique can help answer is 
very simple: Are the groups created by the independent 
variables different regarding the dependent variable? If you 
are interested in differences concerning the means, then 
the appropriate test is the breakdowns one-way ANOVA (F 
test). If you are interested in variation differences, then you 
should test for homogeneity of variances. 

Other Related Data Analysis Techniques: Although 
for exploratory data analysis, breakdowns can use more 
than one independent variable, the statistical procedures in 
breakdowns assume the existence of a single grouping factor, 
those statistics do not reveal or even take into account any 
possible interactions between grouping variables in the 
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design. For example, there could be differences between the 
influences of one independent variable on the dependent 
variable at different levels of another independent variable 
(e.g„ tall people could have lower WCC than short ones, but 
only if they are males; see the "tree” data above). You can 
explore such effects by examining breakdowns “visually,” 
using different orders of independent variables, but the 
magnitude or significance of such effects cannot be estimated 
by the breakdown statistics. 


Post-Hoc Comparisons of Means: Usually, after 
obtaining a statistically significant F test from the ANOVA, 
one wants to know which of the means contributed to the 
effect ( i.e ., which groups are particularly different from each 
other). One could of course perform a series of simple t- 
tests to compare all possible pairs of means. However, such 
a procedure would capitalize on chance . This means that 
the reported probability levels would actually overestimate 
the statistical significance of mean differences. Without going 
into too much detail, suppose you took 20 samples of 10 
random numbers each, and computed 20 means. Then, take 
the group (sample) with the highest mean and compare it 
with that of the lowest mean. The f-test for independent 
samples will test whether or not those two means are 
significantly different from each other, provided they were 
the only two samples taken. Post-hoc comparison techniques 
on the other hand specifically take into account the fact 
that more than two samples were taken. 


Breakdowns vs. Discriminant Function Analysis- 
Breakdowns can be considered as a first step toward another 
type of analysis that explores differences between groups: 
Discriminant function analysis. Similar to breakdowns 
discriminant function analysis explores the differences 
between groups created by values (group codes) of an 
dependent ( grouping ) variable. However, unlike 

ansdvze WnS T lminant fUnCti ° n analysis simultaneously 
lyzes more than one dependent variable and it identifies 
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“patterns” of values of those dependent variables 
Technically, it determines a linear combination of the 
dependent variables that best predicts the group 
membership. For example, discriminant function analysis 
can be used to analyze differences between three groups of 
persons who have chosen different professions ( e.g lawyers 
physicians, and engineers) in terms of various aspects of 
their scholastic performance in high school. One could claim 
that such analysis could “explain” the choice of a profession 
in terms of specific talents shown in high school; thus 
discriminant function analysis can be considered to be an 
“exploratory extension” of simple breakdowns. 


a 


Breakdowns vs. Frequency Tables: Another related 
type of analysis that cannot be directly performed with 
breakdowns is comparisons of frequencies of cases (Vs) 
between groups. Specifically, often the n ’s in individual cells 
are not equal because the assignment of subjects to those 
groups typically results not from an experimenter’s 
manipulation, but from subjects’ pre-existing dispositions. 
If, in spite of the random selection of the entire sample, the 
n s are unequal, then it may suggest that the independent 
variables are related. For example, crosstabulating levels of 
independent variables Age and Education most likely would 
not create groups of equal n, because education is distributed 
differently in different age groups. If you are interested in 
such comparisons, you can explore specific frequencies in 
the breakdowns tables, trying different orders of independent 
variables. However, in order to subject such differences to 
statistical tests, you should use crosstabulations and 
frequency tables, Log-Linear Analysis, or Correspondence 

nalysis (for more advanced analyses on multi-way 
frequency tables). 


Graphical breakdowns: Graphs can often identify 
effects (both expected and unexpected) in the data more 
a “ d som etimes “better” than any other data analysis 
. k ate S°rized graphs allow you to plot the means, 
distributions, correlations, etc. across the groups of a given 
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table (e.g., categorized histograms, categorized probability 
plots, categorized box and whisker plots). The graph below 
shows a categorized histogram, which enables you to quickly 
evaluate and visualize the shape of the data for each group 
(groupl-female, group2-female, etc.). 



The categorized scatterplot (in the graph below) shows 
the differences between patterns of correlations between 
dependent variables across the groups. 
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Additionally, if the software has a brushing facility, 
which supports animated brushing, you can select (i.e., 
highlight) in a matrix scatterplot all data points that belong 
to a certain category in order to examine how those specific 
observations contribute to relations between other variables 
in the same data set. 



FREQUENCY TABLES 

Purpose: Frequency or one-way tables represent the 
simplest method for analyzing categorical data. They are 
often used as one of the exploratory procedures to review 
how different categories of values are distributed in the 
sample. For example, in a survey of spectator interest in 
different sports, we could summarize the respondents’ 

interest in watching football in a frequency table as follows: 

» 

The table shows the number, proportion, and cumulative 
proportion of respondents who characterized their interest 
in watching football as (1) Always interested , (2) Usually 
interested , (3) Sometimes interested , or (4) Never interested. 
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STATISTIC A 
BASIC 

STATS 

FOOTBALL: ‘Watching football” 

Category 

% 

Count 

Count 

Cumulatv 

Percent 

-1 

Percent 

ALWAYS: 

Always interested 

39 

39 

39.00000 

39.0000 

USUALLY: Usually 
interested 

16 

55 

16.00000 

55.0000 

SOMETIMS: 

Sometimes 

interested 

26 

81 

26.00000 

81.0000 

NEVER: 

Never interested 

19 

100 

19.00000 

100.0000 

Missing 

0 

100 

0.00000 

10.0000 


Applications: In practically every research project, a 

first look at the data usually includes frequency tables. 

For example, in survey research, frequency tables can show 

the number of males and females who participated in the 

survey, the number of respondents from particular ethnic 

and racial backgrounds, and so on. Responses on some 

labeled attitude measurement scales ( e.g., interest in 

watching football) can also be nicely summarized via the 

frequency table. In medical research, one may tabulate the 

number of patients displaying specific symptoms; in 

industrial research one may tabulate the frequency of 

different causes leading to catastrophic failure of products 

during stress tests (e.g., which parts are actually responsible 

for the complete malfunction of television sets under extreme 

temperatures?). Customarily, if a data set includes anv 

categorical data, then one of the first steps in the data 

analysis is to compute a frequency table for ^hose categorical 
variables. 

Crosstabulation and Stub-and-Banner Tables 

Purpose and Arrangement of Table: Crosstabulatibn 
a combination of two (or more) frequency tables arranged 
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such that each cell in the resulting table represents a unique 
combination of specific values of crosstabulated variables 
Thus, crosstabulation allows us to examine frequencies of 
observations that belong to specific categories on more than 
one variable. By examining these frequencies, we can 
identify relations between crosstabulated variables. Only 
categorical variables or variables with a relatively small 
number of different meaningful values should be 
crosstabulated. Note that in the cases where we do want to 
include a continuous variable in a crosstabulation ( e.g 
income), we can first recode it into a particular number of 
distinct ranges (e.g., low, medium, high). 

2x2 Table: The simplest form of crosstabulation is the 
2 by 2 table where two variables are “crossed,” and each 
variable has only two distinct values. For example, suppose 
we conduct a simple study in which males and females are 
asked to choose one of two different brands of soda pop 

(biand A and brand B); the data file can be arranged like 
this: 



GENDER 

SODA 

case 1 

MALE 

A 

case 2 

FEMALE 

B 

case 3 

FEMALE 

B 

case 4 

FEMALE 

A 

case 5 

MALE 

B 

• • • 

* * * j 

• • + 


The resulting crosstabulation could look as follows. 



SODA: A 

SODA: B 


GENDER: MALE 
GENDER: FEMALE 

20 (40%) 

30 (60%) 

50 (50%) 

30 (60%) 

20 (40%) 

50 (50%) 

50 (50%) 

50 (50%) 

100 (100%) 


ib ce ^ represents a unique combination of values of 
wo-crosstabulated variables (row variable Gender and 
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column variable Soda), and the numbers in each cell tell us 
how many observations fall into each combination of values. 
In general, this table shows us that more females than 
males chose the soda pop brand A, and that more males 
than females chose soda B. Thus, gender and preference for 
a particular brand of soda may be related (later we will see 
how this relationship can be measured). 

Marginal Frequencies: The values in the margins of 
the table are simply one-way (frequency) tables for all values 
in the table. They are important in that they help us to 
evaluate the arrangement of frequencies in individual 
columns or rows. For example, the frequencies of 40% and 
60% of males and females (respectively) who chose soda A 
(see the first column of the above table), would not indicate 
any relationship between Gender and Soda if the marginal 
frequencies for Gender were also 40% and 60%; in that case 
they would simply reflect the different proportions of males 
and females in the study. Thus, the differences between the 
distributions of frequencies in individual rows (or columns) 
and in the respective margins informs us about the 
relationship between the crosstabulated variables. 

Column, Row and Total Percentages: The example 
in the previous paragraph demonstrates that in order to 
evaluate relationships between crosstabulated variables, we 
need to compare the proportions of marginal and individual 
column or row frequencies. Such comparisons are easiest to 
perform when the frequencies are presented as percentages. 

Graphical Representations of Crosstabulations: For 
analytic purposes, the individual rows or columns of a table 
can be represented as column graphs. However, often it is 
useful to visualize the entire table in a single graph. A two- 
way table can be visualized in a 3-dimensional histogram; 
alternatively, a categorized histogram can be produced, 
where one variable is represented by individual histograms, 
which are drawn at each level (category) of the other variable 
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in the crosstabulation. The advantage of the 3D histogram 
is that it produces an integrated picture of the entire table; 
the advantage of the categorized graph is that it allows u s 
to precisely evaluate specific frequencies in each cell of the 
table. 


Stub-and-Banner Tables: Stub-and-Banner tables, or 
Banners for short, are a way to display several two-way 
tables in a compressed form. This type of table is most 
easily explained with an example. Let us return to the survey 
of sports spectators example. (Note that, in order simplify 
matters, only the response categories Always and Usually 
were tabulated in the table below.) 


STATISTICA 

BASIC 

STATS 

Stub-and-Banner Table: 

Row Percent 

Factor 

FOOTBALL 

ALWAYS 

FOOTBALL 

USUALLY 

1 Row 
Total 

BASEBALL: ALWAYS 
BASEBALL: USUALLY 

92.31 

61.54 

7.69 

38.46 

66.67 

33.33 

BASEBALL: Total 

82.05 

17.95 

100.00 

TENNIS: ALWAYS 
TENNIS: USUALLY 

87.50 

87.50 

12.50 

12.50 

66.67 

33.33 

TENNIS: Total 

87.50 

12.50 

100.00 

BOXING: ALWAYS 
BOXING: USUALLY 

77.78 

100.00 

22.22 

0.00 

52.94 

47.06 

BOXING : Total 

88.24 

11.76 

100.00 


Interpreting the Banner Table; In the table above, 
we see the two-way tables of expressed interest in Football 
by expressed interest in Baseball, Tennis, and Boxing. The 

t:„rr s represe , nt p™*^ ° f s 0 that the 

™t g rrr lu r W,U add U P t0 100 percent. For 

Scrollsheet (92 3lT ' h ^ UP ,P er left hand corner of the 
~p»nd , , h “ 92 31 *U 

id they are always interested in watching 
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football also said that they were always interested in 
watching baseball. Further down we can see that the percent 
of those always interested in watching football who were 
also always interested in watching tennis was 87.50 percent; 
for boxing this number is 77.78 percent. The percentages in 
the last column (Row Total) are always relative to the total 
number of cases. 

Multi-way Tables with Control Variables: When only 
two variables are crosstabulated, we call the resulting table 
a two-way table. However, the general idea of crosstabulating 
values of variables can be generalized to more than just two 
variables. For example, to return to the “soda’’ example 
presented earlier (see above), a third variable could be added 
to the data set. This variable might contain information 
about the state in which the study was conducted (either 
Nebraska or New York). 



GENDER 

SODA 

STATE 

case 1 

MALE 

A 

NEBRASKA 

case 2 

FEMALE 

B 

NEWYORK 

case 3 

FEMALE 

B 

NEBRASKA 

case 4 

FEMALE 

A 

NEBRASKA 

case 5 

MALE 

B 

NEWYORK 

• • • 


• • • 



The crosstabulation of these variables would result in a 
3-way table: 



STATE: NEWYORK 

STATE: NEBRASKA 


SODA: A 

SODA: B 


SODA: A 

SODA: B 


GrMALE 

20 

30 

50 

5 

45 

50 

GrFEMALE 

30 

20 

50 

45 

5 

50 


50 

50 

100 

50 

50 

100 


Theoretically, an unlimited number of variables can be 
crosstabulated in a single multi-way table. However, 
research practice shows that it is usually difficult to examine 
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and “understand” tables that involve more than 4 variables 
It is recommended to analyze relationships between the 
factors in such tables using modeling techniques such as 
Log-Linear Analysis or Correspondence Analysis . 

Graphical Representations of Multi-way Tables: 

You can produce “double categorized” histograms, 3D 
histograms, 



fl Graph1 i Msv&nate Distribulian: IOOUlALl x BASflW. L 


or line-plots that will summarize the frequencies for up to 3 
factors in a single graph. 



nBrspMI: Ph-i: BASTOAiLx BOXING 


Batches (cascades) of graphs can be used to summarize 
higher-way tables (as shown in the graph below). 
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Statistics in Crosstabulation Tables 

General Introduction: Crosstabulations generally 
allow us to identify relationships between the crosstabulated 
variables. The following table illustrates an example of a 
very strong relationship between two variables: variable 
Age C Adult vs. Child) and variable Cookie preference 
(A vs. B ). 



COOKIE: A 

COOKIE: B 


AGE: ADULT 

50 

0 


AGE: CHILD 

0 

50 

ES 


50 

50 

100 


All adults chose cookie A, while all children chose cookie 
B. In this case there is little doubt about the reliability of 
the finding, because it is hardly conceivable that one would 
obtain such a pattern of frequencies by chance alone; that 
is, without the existence of a “true” difference between the 
cookie preferences of adults and children. However, in real- 
life, relations between variables are typically much weaker, 
a nd thus the question arises as to how to measure those 
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relationships, and how to evaluate their reliability (statistical 
significance). The following review includes the most common 
measures of relationships between two categorical variables* 
that is, measures for two-way tables. The techniques used 
to analyze simultaneous relations between more than two 
variables in higher* order crosstabulations are discussed in 
the context of the Log-Linear Analysis module and the 
Correspondence Analysis. 

Pearson Chi-square: The Pearson Chi-square is the 
most common test for significance of the relationship between 
categorical variables. This measure is based on the fact 
that we can compute the expected frequencies in a two-way 
table ( i.e frequencies that we would expect if there was no 
relationship between the variables). For example, suppose 
we ask 20 males and 20 females to choose between two 
brands of soda pop (brands A and B). If there is no 
relationship between preference and gender, then we would 
expect about an equal number of choices of brand A and 
brand B for each sex. The Chi-square test becomes 
increasingly significant as‘the numbers deviate further from 
this expected pattern; that is, the more this pattern of choices 
for males and females differs. 

The value of the Chi-square and its significance level 
depends on the overall number of observations and the 
number of cells in the table. Consistent with the principles 
discussed in Elementary Concepts , relatively small deviations 
of the relative frequencies across cells from the expected 
pattern will prove significant if the number of observations 
is large. 


The only assumption underlying the use of the Chi- 
square (other than random selection of the sample) is that 
the expected frequencies are not very small. The reason for 
this is that, actually, the Chi-square inherently tests the 
underlying probabilities in each cell; and when the expected 
cell frequencies fall, for example, below 5, those probabilities 
cannot be estimated with sufficient precision. 
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Maximum-Likelihood Chi-square: The Maximum- 
likelihood Chi-square tests the same hypothesis as the 
pearson Chi- square statistic; however, its computation is 
based on Maximum-Likelihood theory. In practice, the M-L 
Chi’Squore is usually very close in magnitude to the Pearson 
Chi- square statistic. For more details about this statistic 
refer to Bishop, Fienberg, and Holland (1975), or Fienberg, 

S. E. (1977); the Log-Linear Analysis chapter of the manual 
also discusses this statistic in greater detail. 

Yates Correction: The approximation of the Chi-square 
statistic in small 2x2 tables can be improved by reducing 
the absolute value of differences between expected and 
observed frequencies by 0.5 before squaring {Yates* 
correction). This correction, which makes the estimation more 
conservative, is usually applied when the table contains 
only small-observed frequencies, so that some expected 
frequencies become less than 10. 

Fisher Exact Test: This test is only available for 2x2 
tables; it is based on the following rationale: Given the 
marginal frequencies in the table, and assuming that in the 
population the two factors in the table are not related, how 
likely is it to obtain cell frequencies as uneven or worse 
than the ones that were observed? For small n, this 
probability can be computed exactly by counting all possible 
tables that can be constructed based on the marginal 
frequencies. Thus, the Fisher exact test computes the exact 
probability under the null hypothesis of obtaining the current 
distribution of frequencies across cells, or one that is more 
uneven. 

McNemar Chi-square: This test is applicable in 
situations where the frequencies in the 2x2 table represent 
dependent samples. For example, in a before-after design 
study, we may count the number of students who fail a test 
°f minimal math skills at the beginning of the semester 
a nd at the end of the semester. Two Chi-square values are 
reported: A/D and B/C. The Chi-square A/D tests the 
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hypothesis that the frequencies in cells A and D (upper left 
lower right) are identical. The Chi-square B/C tests the 
hypothesis that the frequencies in cells B and C (upp er 
right, lower left) are identical. 

, Coefficient Phi: The Phi-square is a measure of 
correlation between two categorical variables in a 2 x 2 
table. Its value can range from 0 (no relation between factors- 
Chi-square = 0.0) to 1 (perfect relation between the two 
factors in the table). 

Tetrachoric Correlation: This statistic is also only 
computed for (applicable to) 2 x 2 tables. If the 2 x 2 table 
can be thought of as the result of two continuous variables 
that were (artificially) forced into two categories each, then 
the tetrachoric correlation coefficient will estimate the 
correlation between the two. 

Coefficient of Contingency: The coefficient of 
contingency is a Chi-square based measure of the relation 
between two categorical variables (proposed by Pearson, the 
originator of the Chi-square test). Its advantage over the 
ordinary Chi-square is that it is more easily interpreted, 
since its range is always limited to 0 through 1 (where 0 
means complete independence). The disadvantage of this 
statistic is that its specific upper limit is “limited” by the 
size of the table; C can reach the limit of 1 only if the 
number of categories is unlimited. 

Interpretation of Contingency Measures: An 
important disadvantage of measures of contingency 
(reviewed above) is that they do not lend themselves to 
clear interpretations in terms of probability or “proportion 
of variance,” as is the case, for example, of the Pearson r. 
There is no commonly accepted measure of relation between 
categories that has such a clear interpretation. 

Statistics Based on Ranks: In many cases the 
categories used in the crosstabulation contain meaningful 
rank-ordering information; that is, they measure some 
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characteristic on an < > ordinal scale. Suppose we asked a 
sample of respondents to indicate their interest in watching 
different sports on a 4-point scale with the explicit labels 
(1) alwaySy (2) usually , (3) sometimes, and (4) never 
interested. Obviously, we can assume that the response 
sometimes interested is indicative of less interest than always 
interestedy and so on. Thus, we could rank the respondents 
with regard to their expressed interest in, for example, 
watching football. When categorical variables can be 
interpreted in this manner, there are several additional 
indices that can be computed to express the relationship 
between variables. 

Spearman R: Spearman R can be thought of as the 
regular Pearson product-moment correlation coefficient 
(Pearson r); that is, in terms of the proportion of variability 
accounted for, except that Spearman R is computed from 
ranks. As mentioned above^JSpearman R assumes that the 
variables under consideration were measured on at least an 
ordinal (rank order) scale; that is, the individual 
observations (cases) can be ranked into two ordered series. 

Kendall tau: Kendall tau is equivalent to the Spearman 
R statistic with regard to the underlying assumptions. It is 
also comparable in terms of its statistical power. However, 
Spearman R and Kendall tau are usually not identical in 
magnitude because their underlying logic, as well as their 
computational formulas are very different. Siegel and 
Castellan (1988) express the relationship of the two 
measures in terms of the inequality: 

-1 < = 3 * Kendall tau - 2 * Spearman R < = 1 

More importantly, Kendall tau and Spearman R imply 
different interpretations: While Spearman R can be thought 
of as the regular Pearson product-moment correlation 
coefficient as computed from ranks, Kendall tau rather 
represents a probability. Specifically, it is the difference 
between the probabilities that the observed data are in the 
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same order for the two variables versus the probability that 
the observed data are in different orders for the two 
variables. Kendall (1948, 1975), Everitt (1977), and Siegel 
and Castellan (1988) discuss Kendall tau in greater detail. 
Two different variants of tau are computed, usually called 
tau h and tau r . These measures differ only with regard as to 
how tied ranks are handled. In most cases these values will 
be fairly similar, and when discrepancies occur, it is probably 
always safest to interpret the lowest value. 

SOMMER’S D: D(X| Y), D(Y|X). 

Sommer’s d is an asymmetric measure of association 
related to t b 

Gamma: The Gamma statistic is preferable to Spearman 
R or Kendall tau when the data contain many tied 
observations. In terms of the underlying assumptions, 
Gamma is equivalent to Spearman R or Kendall tau; in 
terms of its interpretation and computation, it is more 
similar to Kendall tau than Spearman R. In short, Gamma 
is also a probability; specifically, it is computed as the 
difference between the probability that the rank ordering of 
the two variables agree minus the probability that they 
disagree, divided by 1 minus the probability of ties. Thus, 
Gamma is basically equivalent to Kendall tau, except that 
ties are explicitly taken into account. 

Uncertainty Coefficients: These are indices of 
stochastic dependence; the concept of stochastic dependence 
is derived from the information theory approach to the 
analysis of frequency tables and the user should refer to 
the appropriate references. S(Y,X) refers to symmetrical 
dependence, S(X| Y) and »S(F|X) refer to asymmetrical 
dependence. 

Multiple Responses/Dichotomies: Multiple response 
variables or multiple dichotomies often arise when 
summarizing survey data. The nature of such variables or 
factors in a table is best illustrated with examples. 
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Multiple Response Variables: As part of a larger 
market survey, suppose you asked a sample of consumers 
to name their three favorite soft drinks. The specific item 
on the questionnaire may look like this: 

Write down your three favorite soft drinks: 

1:_ 2: 3:_ 

Thus, the questionnaires returned to you will contain 
somewhere between 0 and 3 answers to this item. Also, a 
wide variety of soft drinks will most likely be named. Your 
goal is to summarize the responses to this item; that is, to 
produce a table that summarizes the percent of respondents 
who mentioned a respective soft drink. 

The next question is how to enter the responses into a 
data file. Suppose 50 different soft drinks were mentioned 
among all of the questionnaires. You could of course set up 
50 variables - one for each soft drink—and then enter a 1 
for the respective respondent and variable (soft drink), if he 
or she mentioned the respective soft drink (and a 0 if not); 
for example: 


COKE 

PEPSI 

SPRITE 

-l 

• • • • 

case 1 

0 

1 

0 

case 2 

1 

1 

0 

case 3 

• • • 

0 

0 

1 

• • • 


This method of coding the responses would be very 
tedious and “wasteful.” Note that each respondent can only 
give a maximum of three responses; yet we use 50 variables 
to code those responses. (However, if we are only interested* 
in these three soft drinks, then this method of coding just 
those three variables would be satisfactory; to tabulate soft 
drink preferences, we could then treat the three variables 
as a multiple dichotomy.) 

Coding multiple response variables: Alternatively, 
we could set up three variables, and a coding scheme for 
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the 50 soft drinks. Then we could enter the respective codes 
(or alpha labels) into the three variables, in the same way 
that respondents wrote them down in the questionnaire. 



Resp. 1 

i 

Resp. 2 

Resp. 3 

case 1 

COKE 

PEPSI 

JOLT 

case 2 

SPRITE 

SNAPPLE 

DR. PEPPER 

case 3 

PERRIER 

GATORADE 

MOUNTAIN DEW 

• • • 

• • • 

• • • 



To produce a table of the number of respondents by soft 
drink we would now treat Resp.l to Resp3 as a multiple 
response variable. That table could look like this: 


N=500 

Category 

Count 

Prcnt. of 
Responses 

Prcnt. of 1 
Cases 

COKE: Coca Cola 

44 

5.23 

8.80 

PEPSI: Pepsi Cola 

43 

5.11 

8.60 

MOUNTAIN: Mountain Dew 

81 

9.62 

16.20 

PEPPER: Doctor Pepper 

74 

• • • 

$ 

8.79 

0 9 9 

14.80 

• • • 


842 

100.00 

168.40 


Interpreting the multiple response frequency 
table: The total number of respondents was 71=500. Note 
that the counts in the first column of the table do not add 
up to 500, but rather to 842. That is the total number of 
responses; since each respondent could make up to 3 
responses (write down three names of soft drinks), the total 
number of responses is naturally greater than the number 
of respondents. For example, referring back to the sample 
listing of the data file shown above, the first case (Coke, 
Pepsi, Jolt ) “contributes” three times to the frequency table, 
once to the category Coke , once to the category Pepsi, and 
once to the category Jolt. The second and third columns in 
the table above report the percentages relative to the number 
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o responses (second column) as well as respondents (third 

co umn); Thus, the entry 8.80 in the first row and last 

column in the table above means that 8.8% of all respondents 

mentioned Coke either as their first, second, or third soft 
drink preference. 


Multiple Dichotomies: Suppose in the above example 
we were only interested in Coke, Pepsi, and Sprite. As 

pointed out earlier, one way to code the data in that case 
would be as follows: 



COKE 

PEPSI 

SPRITE 

• • • ■ 

case 1 


1 



case 2 

1 

1 



case 3 



1 


• • • 

• • • 

• • • 

• • • 



In other words, one variable was created for each soft 
drink, then a value of 1 was entered into the respective 
variable whenever the respective drink was mentioned by 
the respective respondent. Note that each variable represents 
a dichotomy ; that is, only ‘Ts and “not 2”s are allowed (we 
could have entered 1 s and 0’s, but to save typing we can 
also simply leave the 0’s blank or missing). When tabulating 
these variables, we would like to obtain a summary table 
very similar to the one shown earlier for multiple response 
variables, that is, we would like to compute the number 
and percent of respondents (and responses) for each soft 
drink. In a sense, we “compact” the three variables Coke, 
Pepsi, and Sprite into a single variable ( Soft Drink) 
consisting of multiple dichotomies . 

Crosstabulation of Multiple Responses/Dichoto¬ 
mies: All of these types of variables can then be used in 
crosstabulation tables. For example, we could crosstabulate 
a multiple dichotomy for Soft Drink (coded as described in 
the previous paragraph) with a multiple response variable 
Favorite Fast Foods (with many categories such as 
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Hamburgers, Pizza , etc.), by the simple categorical variable 
Gender. As in the frequency table, the percentages and 
marginal totals in that table can be computed from the 
total number of respondents as well as the total number of 
responses. For example, consider the following hypothetical 
respondent: 


Gender 

Coke 

Pepsi 

Sprite 

Foodl 


FEMALE 

1 

1 


FISH 

PIZZA 


This female respondent mentioned Coke and Pepsi as her 
favorite drinks, and Fish and Pizza as her favorite fast foods. 
In the complete crosstabulation table she will be counted in 
the following cells of the table: 


Food 

■ 

TOTAL No. 
of RESP. 

Gender 

Drink 

HAMBURG 

FISH 

PIZZA 

■ 

FEMALE 

COKE 


X 

X 


2 


PEPSI 


X 

X 


2 


SPRITE 






MALE 

COKE 







PEPSI 







SPRITE 







This female respondent will “contribute” to (i.e., be 
counted in) the crosstabulation table a total of 4 times. In 
addition, she will be counted twice in the Female—Coke 
marginal frequency column if that column is requested to 
represent the total number of responses; if the marginal 
totals are computed as the total number of respondents, 
then this respondent will only be counted once. 

Paired Crosstabulation of Multiple Response 
Variables: A unique option for tabulating multiple response 
variables is to treat the variables in two or more multiple 
response variables as matched pairs. Again, this method is 
est illustrated with a simple example. Supp ose we 
con ucted a survey of past and present home ownership. 













fflosic Statistics 


61 


We asked the respondents to describe their last three 
(including the present) homes that they purchased. 
Naturally, for some respondents the present home is the 
first and only home; others have owned more than one home 
in the past. For each home we asked our respondents to 
write down the number of rooms in the respective house, 
and the number of occupants. Here is how the data for one 
respondent (say case number 112) may be entered into a 
data file: 



Rooms 

D 

2 

3 

No. Occ. 

B 

2 

3 

112 


3 

3 

4 


2 

3 

5 


This respondent owned three homes; the first had 3 
rooms, the second also had 3 rooms, and the third had 4 
rooms. The family apparently also grew; there were 2 
occupants in the first home, 3 in the second, and 5 in the 
third. 

Now suppose we wanted to crosstabulate the number of 
rooms by the number of occupants for all respondents. One 
way to do so is to prepare three different two-way tables; 
one for each home. We can also treat the two factors in this 
study (Number of Rooms, Number of Occupants) as multiple 
response variables. However, it would obviously not make 
any sense to count the example respondent 112 shown above 
in cell 3 Rooms - 5 Occupants of the crosstabulation table 
(which we would, if we simply treated the two factors as 
ordinary multiple response variables). In other words, we 
want to ignore the combination of occupants in the third 
home with the number of rooms in the first home. Rather, 
we would like to count these variables in pairs ; we would 
like to consider the number of rooms in the first home 
together with the number of occupants in the first home, 
the number of rooms in the second home with the number 
of occupants in the second home, and so on. This is exactly 
what will be accomplished if we asked for a paired 
crosstabulation of these multiple response variables. 
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A Final Comment: When preparing complex 
cross tabulation tables with multiple responses/dichotomies, 
it is sometimes difficult (in our experience) to “keep track” 
of exactly how the cases in the file are counted. The best 
way to verify that one understands the way in which the 
respective tables are constructed is to crosstabulate some 
simple example data, and then to trace how each case is 
counted. The example section of the Crosstabulation chapter 
in the manual employs this method to illustrate how data 
are counted for tables involving multiple response variables 
and multiple dichotomies. 



Nonparametric Statistics 


General Purpose 

% 

Brief review of the idea of significance testing: To 
understand the idea of nonparametric statistics first requires 
a basic understanding of parametric statistics. The 
Elementary Concepts chapter of the manual introduces the 
concept of statistical significance testing based on the 
sampling distribution of a particular statistic (you may want 
to review that chapter before reading on). In short, if we 
have a basic knowledge of the underlying distribution of a 
variable, then we can make predictions about how, in 
repeated samples of equal size, this particular statistic will 
“behave,” that is, how it is distributed. 

For example, if we draw 100 random samples of 100 
adults each from the general population, and compute the 
mean height in each sample, then the distribution of the 
standardized means across samples will likely approximate 
the normal distribution (to be precise, Student’s t distribution 
with 99 degrees of freedom; see below). Now imagine that 
we take an additional sample in a particular city (“Tallburg”) 
where we suspect that people are taller than the average 
P°pulation. If the mean height in that sample falls outside 
the upper 95% tail area of the t distribution then we conclude 
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that, indeed, the people of Tallburg are taller than the 
average population. 


Are most variables normally distributed? In the above 
example we relied on our knowledge that, in repeated 
samples of equal size, the standardized means (for height) 
will be distributed following the t distribution (with a 
particular mean and variance). However, this will only be 
true if in the population the variable of interest (height in 
our example) is normally distributed, that is, if the 
distribution of people of particular heights follows the normal 
distribution (the bell-shape distribution). 



For many variables of interest, we simply do not know 
for sure that this is the case. For example, is income 
distributed normally in the population?—Probably not. The 
incidence rates of rare diseases are not normally distributed 
in the population, the number of car accidents is also not 
normally distributed, and neither are very many other 
variables in which a researcher might be interested. 

For more information on the normal distribution, see 
Elementary Concepts; for information on tests of normality. 
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Sample size: Another factor that often limits the 
applicability of tests based on the assumption that the 
sampling distribution is normal is the size of the sample of 
data available for the analysis ( sample size; n). We can 
assume that the sampling distribution is normal even if we 
are not sure that the distribution of the variable in the 
population is normal, as long as our sample is large enough 
{e.g., 100 or more observations). However, if our sample is 
very small, then those tests can be used only if we are sure 
that the variable is normally distributed, and there is no 
way to test this assumption if the sample is small. 


Problems in Measurement 


Applications of tests that are based on the normality 
assumptions are further limited by a lack of precise 
measurement. For example, let us consider a study where 
grade point average ( GPA ) is measured as the major variable 
of interest. Is an A average twice as good as a C average? Is 
the difference between a B and an A average comparable to 
the difference between a D and a C average? Somehow, the 
GPA is a crude measure of scholastic accomplishments that 
only allows us to establish a rank ordering of students from 
“good” students to “poor” students. This general measure¬ 
ment issue is usually discussed in statistics textbooks in 
terms of types of measurement or scale of measurement. 

ithout going into too much detail, most common statistical 
techniques such as analysis of variance (and U tests) 
regression, etc. assume that the underlying measurements 
are at least of interval, meaning that equally spaced intervals 
on the scale can be compared in a meaningful manner (e.e 
B minus A is equal to D minus C). However, as in our 
example this assumption is very often not tenable, and the 
data rather represent a rank ordering of observations 
(ordinal) rather than precise measurements. 


Parametric and Nonparametric Methods 

Hopefully, after this somewhat lengthy introduction, the 
need is evident for statistical procedures that allow us to 
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process data of “low quality,” from small samples, on 
variables about which nothing is known (concerning their 
distribution). Specifically, nonparametric methods were 
developed to be used in cases when the researcher knows 
nothing about the parameters of the variable of interest i n 
the population (hence the name nonparametric). In more 
technical terms, nonparametric methods do not rely on the 
estimation of parameters (such as the mean or the standard 
deviation) describing the distribution of the variable of 
interest in the population. Therefore, these methods are 
also sometimes (and more appropriately) called parameter- 
free methods or distribution-free methods. 


Brief Overview of Nonparametric Methods 

Basically, there is at least one nonparametric equivalent 
for each parametric general type of test. In general, these 
tests fall into the following categories: 

• Tests of differences between groups (independent 
samples); 

Tests of differences between variables (dependent 
samples); 

• Tests of relationships between variables. 

Differences between independent groups: Usually, 
when we have two samples that we want to compare 
concerning their mean value for some variable of interest, 
we would use the t-test for independent samples in Basic 
Statistics ); nonparametric alternatives for this test are the 
Wald- Wolfowitz runs test , the Mann-Whitney U test, and 
the Kolmogorov-Smirnou two-sample test. If we have multiple 
groups, we would use analysis of variance. 

Differences between dependent groups: If we want 
to compare two variables measured in the same sample we 
would customarily use the t-test for dependent samples (in 
Basic Statistics for example, if we wanted to compare 
students math skills at the beginning of the semester with 
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their skills at the end of the semester). Nonparametric 
alternatives to this test are the Sign test and Wilcoxon’s 
matched pairs test. If the variables of interest are 
dichotomous in nature {i.e., “pass” vs. “no pass”) then 
McNemar’s Chi-square test is appropriate. If there are more 
than two variables that were measured in the same sample, 
then we would customarily use repeated measures ANOVA. 
Nonparametric alternatives to this method are Friedman’s 
two-way analysis of variance and Cochran Q test (if the 
variable was measured in terms of categories, e.g., “passed” 
vs. “failed”). Cochran Q is particularly useful for measuring 
changes in frequencies (proportions) across time. 

Relationships between variables: To express a 
relationship between two variables one usually computes 
the correlation coefficient. Nonparametric equivalents to the 
standard correlation coefficient are Spearman R, Kendall 
Tau, and coefficient Gamma. If the two variables of interest 
are categorical in nature {e.g., “passed” vs. “failed” by “male” 
vs. “female”) appropriate nonparametric statistics for testing 
the relationship between the two variables are the Chi- 
square test, the Phi coefficient, and the Fisher exact test. In 
addition, a simultaneous test for relationships between 
multiple cases is available: Kendall coefficient of concordance. 
This test is often used for expressing inter-rater agreement 
among independent judges who are rating (ranking) the 
same stimuli. 

Descriptive statistics: When one’s data are not 
normally distributed, and the measurements at best contain 
rank order information, then computing the standard 
descriptive statistics {e.g., mean, standard deviation) is 
sometimes not the most informative way to summarize the 
data. For example, in the area of psychometrics it is well 
known that the rated intensity of a stimulus {e.g., perceived 
brightness of a light) is often a logarithmic function of the 
actual intensity of the stimulus (brightness as measured in 
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objective units of Lux). In this example the 8 i™i 
rating (sum of ratings divided by the number of stimu^ 8 ” 
not an adequate summary of the average actual i 1 18 
the stimuli. (In this example, 

compute the geometric mean.) Nonparamet Ls ‘ n d 
Distributions will compute a wide variety of measures J 
location (mean, median, mode, etc.) and dispersion (varianc e f 
average deviation, quartile range, etc.) to provide the 

complete picture” of one’s data P de the 


When to Use Which Method 

It is not easy to give simple advice concerning the use of 
nonparametric procedures. Each nonparametric procedure 
has its peculiar sensitivities and blind spots. For example 
the Kolmogorov-Smirnov two-sample test is not only 
sensitive to differences in the location of distributions (for 
example, differences in means) but is also greatly affected 
by differences in their shapes. The Wilcoxon matched pairs 
test assumes that one can rank order the magnitude of 
differences in matched observations in a meaningful manner. 
If this is not the case, one should rather use the Sign test. 
In general, if the result of a study is important (e.g., does a 
very expensive and painful drug therapy help people get 
better?), then it is always advisable to run different 


nonparametric tests; should discrepancies in the results 
occur contingent upon which test is used, one should try to 
understand why some tests give different results. On the 
other hand, nonparametric statistics are less statistically 
powerful (sensitive) than their parametric counterparts, and 
if it is important to detect even small effects (e.g., is this 
food additive harmful to people?) one should be very careful 
in the choice of a test statistic. 


Large data sets and nonparametric methods: 
Nonparametric methods are most appropriate when the 
sample sizes are small. When the data set is large (e.g., n > 
100) it often makes little sense to use nonparametric 


/ 
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statistics at all. The Elementary Concepts chapter of the 
manual briefly discusses the idea of the central limit 
theorem . In a nutshell, when the samples become very large, 
then the sample means will follow the normal distribution 
even if the respective variable is not normally distributed 
in the population, or is not measured very well. Thus, 
parametric methods, which are usually much more sensitive 
(i.e., have more statistical power ) are in most cases 
appropriate for large samples. However, the tests of 
significance of many of the nonparametric statistics described 
here are based on asymptotic (large sample) theory; 
therefore, meaningful tests can often not be performed if 
the sample sizes become too small. Please refer to the 
descriptions of the specific tests to learn more about their 
power and efficiency. 

Nonparametric Correlations 

The following are three types of commonly used 
nonparametric correlation coefficients. Note that the chi- 
square statistic computed for two-way frequency tables also 
provides a careful measure of a relation between the two 
(tabulated) variables, and unlike the correlation measures 
listed below, it can be used for variables that are measured 
on a simple nominal scale. 

Spearman R. Spearman R (Siegel & Castellan, 1988) 
assumes that the variables under consideration were 
measured on at least an ordinal (rank order) scale, that is, 
that the individual observations can be ranked into two 
ordered series. Spearman R can be thought of as the regular 
Pearson product moment correlation coefficient, that is, in 
terms of proportion of variability accounted for, except that 
Spearman R is computed from ranks. 

Kendall tau. Kendall tau is equivalent to Spearman R 
with regard to the underlying assumptions. It is also 
comparable in terms of its statistical power. However, 
Spearman R and Kendall tau are usually not identical in 
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magnitude because their underlying logic as well as their 
computational formulas are very different. Siegel and 
Castellan (1988) express the relationship of the two 
measures in terms of the inequality: 

-1 £ 3 - Kendall tau - 2 * Spearman R £ 1 

More importantly, Kendall tau and Spearman R imply 
different interpretations: Spearman R can be thought of as 
the regular Pearson product moment correlation coefficient, 
that is, in terms of proportion of variability accounted for, 
except that Spearman R is computed from ranks. Kendall 
tau, on the other hand, represents a probability, that is, it 
is the difference between the probabilities that in the 
observed data the two variables are in the same order versus 
the probability that the two variables are in different orders. 

Gamma: The Gamma statistic (Siegel & Castellan, 1988) 
is preferable to Spearman R or Kendall tau when the data 
contain many tied observations. In terms of the underlying 
assumptions, Gamma is equivalent to Spearman R or 
Kendall tau; in terms of its interpretation and computation 
it is more similar to Kendall tau than Spearman R. In 
short, Gamma is also a probability; specifically, it is 
computed as the difference between the probability that the 
rank ordering of the two variables agree minus the 
probability that they disagree, divided by 1 minus the 
probability of ties. Thus, Gamma is basically equivalent to 
Kendall tau, except that ties are explicitly taken into account. 



ANOVA/MANOVA 


BASIC IDEAS 

The Purpose of Analysis of Variance 

In general, the purpose of analysis of variance (ANOVA) is 
to test for significant differences between means. Elementary 
Concepts provides a brief introduction into the basics of 
statistical significance testing. If we are only comparing 
two means, then ANOVA will give the same results as the t 
test for independent samples (if we are comparing two 
different groups of cases or observations), or the t test for 
dependent samples (if we are comparing two variables in 
one set of cases or observations). 

Why the name analysis of variance? It may seem 
odd to you that a procedure that compares means is called 
analysis of variance. However, this name is derived from 
the fact that in order to test for statistical significance 
between means, we are actually comparing ( i.e ., analyzing) 
variances. 

• The Partitioning of Sums of Squares 

• Multi-Factor ANOVA 

• Interaction Effects 
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For more introductory topics, see the topic name. 

• Complex Designs 

• Analysis of Covariance (ANCOVA) 

• Multivariate Designs: MANOVA/MANCOVA 

• Contrast Analysis and Post hoc Tests 

• Assumptions and Effects of Violating Assumptions 

The Partitioning of Sums of Squares 

At the heart of ANOVA is the fact that variances can be 
divided up, that is, partitioned. Remember that the variance 
is computed as the sum of squared deviations from the 
overall mean, divided by n-1 (sample size minus one). Thus, 
given a certain n, the variance is a function of the sums of 
(deviation) squares, or SS for short. Partitioning of variance 
works as follows. Consider the following data set: 



Group 1 

Group 2 

Observation 

1 

2 

6 

Observation 

2 

3 

7 

Observation 

3 

1 

5 

Mean 


2 

6 

Sums of Squares (SS) 

1 

2 

2 

Overall Mean 

4 


Total Sums of Squares 

428 



The means for the two groups are quite different (2 and 
6, respectively). The sums of squares within each group are 
equal to 2. Adding them together, we get 4. If we now 
repeat these computations, ignoring group membership that 
is, if we compute the total SS based on the overall mean we 
get the number 28. In other words, computing the variance 
(sums of squares) based on the within-group variability 
yields a much smaller estimate of variance than computing 
it based on the total variability (the overall mean). The 
reason for this in the above example is of course that there 
is a laige difference between means, and it is this difference 
that accounts for the difference in the SS. In fact, if we 
were to perform an ANOVA on the above data, we would 
get the following result: 
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MAIN EFFECT 


SS 

df 

MS 

F 

P 

Effect 

24.0 

1 

24.0 

24.0 

.008 

Error 

4.0 

4 

1.0 




As you can see, in the above table the total SS (28) was 
partitioned into the SS due to within-gvoup variability 
( 2+2^4) and variability due to differences between means 
(28-(2+2)=24). 

SS Error and SS Effect: The within-group variability 
(SS) is usually referred to as Error variance. This term 
denotes the fact that we cannot readily explain or account 
for it in the current design. However, the SS Effect we can 
explain. Namely, it is due to the differences in means 
between the groups. Put another way, group membership 
explains this variability because we know that it is due to 
the differences in means. 

Significance testing: The basic idea of statistical 
significance testing is discussed in Elementary Concepts. 
Elementary Concepts also explains why very many statistical 
tests represent ratios of explained to unexplained variability. 
ANOVA is a good example of this. Here, we base this test 
on a comparison of the variance due to the between- groups 
variability (called Mean Square Effect , or MS effec[ ) with the 
within- group variability (called Mean Square Error, or 
Ms : this term was first used by Edgeworth, 1885). Under 
the null hypothesis (that there are no mean differences 
between groups in the population), we would still expect 
some minor random fluctuation in the means for the two 
groups when taking small samples (as in our example). 
Therefore, under the null hypothesis, the variance estimated 
based on within-group variability should be about the same 
as the variance due to between-groups variability. We can 
compare those two estimates of variance via the F test, 
which tests whether the ratio of the two variance estimates 
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is significantly greater than 1. In our example above th 
test is highly significant, and we would in fact conclud 
that the means for the two groups are significantly differed 
from each other. 


Summary of the basic logic of ANOVA: To 
summarize the discussion up to this point, the purpose of 
analysis of variance is to test differences in means (for groups 
or variables) for statistical significance. This is accomplished 
by analyzing the variance, that is, by partitioning the total 
variance into the component that is due to true random 
error (i.e., within- group SS) and the components that are 
due to differences between means. These latter variance 
components are then tested for statistical significance, and, 
if significant, we reject the null hypothesis of no differences 
between means, and accept the alternative hypothesis that 
the means (in the population) are different from each other. 

Dependent and independent variables: The 

variables that are measured (e.g., a test score) are called 
dependent variables. The variables that are manipulated or 
controlled {e.g., a teaching method or some other criterion 
used to divide observations into groups that are compared) 
are called factors or independent variables. 

Multi-Factor ANOVA 

In the simple example above, it may have occurred to 
you that we could have simply computed a t test for 
independent samples to arrive at the same conclusion. And, 
indeed, we would get the identical result if we were to 
compare the two groups using this test. However, ANOVA 
is a much more flexible and powerful technique that can be 
applied to much more complex research issues. 

Multiple factors: The world is complex and multivariate 
in nature, and instances when a single variable completely 
explains a phenomenon are rare. For example, when trying 
to explore how to grow a bigger tomato, we would need to 
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consider factors that have to do with the plants’ genetic 
makeup, soil conditions, lighting, temperature, etc. Thus, 
in a typical experiment, many factors are taken into account. 
One important reason for using ANOVA methods rather 
than multiple two-group studies analyzed via t tests is that 
the former method is more efficient , and with fewer 
observations we can gain more information. Let us expand 
on this statement. 

Controlling for factors: Suppose that in the above 
two-group example we introduce another grouping factor, 
for example, Gender. Imagine that in each group we have 3 
males and 3 females. We could summarize this design in a 
2 by 2 table: 



Experimental 

Experimental 


Group 1 

Group 2 

Males 

2 

6 


3 

7 


1 

5 

Mean 

2 

6 

Females 

4 

8 


5 

9 

• 

3 

7 

Mean 

4 

8 


Before performing any computations, it appears that we 
can partition the total variance into at least 3 sources: (1) 
error (within-group) variability, (2) variability due to 
experimental group membership, and (3) variability due to 
gender. (Note that there is an additional source— 
interaction —that we will discuss shortly.) What would have 
happened had we not included gender as a factor in the 
study but rather computed a simple t test? If you compute 
the SS ignoring the gender factor (use the within-group 
means ignoring or collapsing across gender, the result is 
SS= 10+10=20), you will see that the resulting within-group 
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SS is larger than it is when we include gender (use the 
within- group, within-gender means to compute those SS; 
they will be equal to 2 in each group, thus the combined 
SS-within is equal to 2+2+2+2=8). This difference is due to 
the fact that the means for males are systematically lower 
than those for females , and this difference in means adds 
variability if we ignore this factor. Controlling for error 
variance increases the sensitivity (power) of a test. This 
example demonstrates another principal of ANOVA that 
makes it preferable over simple two-group t test studies: In 
ANOVA we can test each factor while controlling for all 
others; this is actually the reason why ANOVA is more 
statistically powerful ( i.e ., we need fewer observations to 
find a significant effect) than the simple t test. 


Interaction Effects 

There is another advantage of ANOVA over simple t- 
tests: ANOVA allows us to detect interaction effects between 
variables, and, therefore, to test more complex hypotheses 
about reality. Let us consider another example to illustrate 

this point. 

Main effects, two-way interaction. Imagine that we have 
a sample of highly achievement-oriented students and 
another of achievement “avoiders.” We now create two 
random halves in each sample, and give one half of each 
sample a challenging test, the other an easy test. We 
measure how hard the students work on the test. The means 
of this (fictitious) study are as follows: 



Achievement- 

oriented 

Achievement- 

avoiders 

105 

510 


Challenging Test 
Easy Test 


How can we summarize these results? Is *t 
appropriate to conclude that (1) challenging tests make 
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students work harder, (2) achievement-oriented students 
work harder than achievement- avoiders? None of these 
statements captures the essence of this clearly systematic 
pattern of means. The appropriate way to summarize the 
result would be to say that challenging tests make only 
achievement-oriented students work harder, while easy tests 
make only achievement-avoiders work harder. In other 
words, the type of achievement orientation and test difficulty 
interact in their effect on effort; specifically, this is an 
example of a two-way interaction between achievement 
orientation and test difficulty. Note that statements 1 and 
2 above describe so-called main effects. 

Higher order interactions: While the previous two- 
way interaction can be put into words relatively easily, 
higher order interactions are increasingly difficult to 
verbalize. Imagine that we had included factor Gender in 
the achievement study above, and we had obtained the 
following pattern of means: 


Females 

Achievement- 

oriented 

Achievement- 

avoiders 

Challenging Test 

10 

5 

Easy Test 

5 

10 

Males 

Achievement- 

Achievement- 


oriented 

avoiders 

Challenging Test 

1 

6 

Easy Test 

6 

1 


How could we now summarize the results of our 
study? Graphs of means for all effects greatly facilitate the 
interpretation of complex effects. The pattern shown in the 
table above (and in the graph below) represents a three-way 
interaction between factors. 
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Thus we may summarize this pattern by saying that for 
females there is a two-way interaction between achievement- 
orientation type and test difficulty: Achievement-oriented 
females work harder on challenging tests than on easy tests, 
achievement-avoiding females work harder on easy tests 
than on difficult tests. For males, this interaction is reversed. 
As you can see, the description of the interaction has become 
much more involved. 


A general way to express interactions: A general 
way to express all interactions is to say that an effect is 
modified (qualified) by another effect. Let us try this with 
the two-way interaction above. The main effect for test 
difficulty is modified by achievement orientation. For the 
three-way interaction in the previous paragraph, we may 
summarize that the two-way interaction between test 
difficulty and achievement orientation is modified (qualified) 
by gender. If we have a four-way interaction, we may say 
that the three-way interaction is modified by the fourth 
variable, that is, that there are different types of interactions 
in the different levels of the fourth variable. As it turns out, 
in many areas of research five- or higher-way in teractions 
are not that uncommon. 
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Complex Designs 

Let us review the basic “building blocks” of complex 
designs. 

Between-Groups and Repeated Measures: When we 
want to compare two groups, we would use the t test for 
independent samples, when we want to compare two 
variables given the same subjects (observations), we would 
use the t test for dependent samples. This distinction— 
dependent and independent samples—is important for 
ANOVA as well. Basically, if we have repeated measure¬ 
ments of the same variable (under different conditions or at 
different points in time) on the same subjects, then the factor 
is a repeated measures factor (also called a within-subjects 
factor, because to estimate its significance we compute the 
within-subjects SS). If we compare different groups of 
subjects ( e.g ., males and females; three strains of bacteria, 
etc.) then we refer to the factor as a between-groups factor. 
The computations of significance tests are different for these 
different types of factors; however, the logic of computations 
and interpretations is the same. 

Between-within designs: In many instances, 
experiments call for the inclusion of between-groups and 
repeated measures factors. For example, we may measure 
math skills in male and female students (gender, a between- 
groups factor) at the beginning and the end of the semester. 
The two measurements on each student would constitute a 
within-subjects (repeated measures) factor. The 
interpretation of main effects and interactions is not affected 
by whether a factor is between-groups or repeated measures, 
and both factors may obviously interact with each other 
(e.g., females improve over the semester while males 
deteriorate). 

Incomplete (Nested) Designs 

There are instances where we may decide to ignore 
interaction effects. This happens when (1) we know that in 
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the population the interaction effect is negligible, or (2) 
when a complete factorial design (this term was first 
introduced by Fisher, 1935a) cannot be used for economic 
reasons. Imagine a study where we want to evaluate the 
effect of four fuel additives on gas mileage. For our test, our 
company has provided us with four cars and four drivers. A 
complete factorial experiment, that is, one in which each 
combination of driver, additive, and car appears at least 
once, would require 4 x 4 x 4 = 64 individual test conditions 
(groups). However, we may not have the resources (time) to 
run all of these conditions; moreover, it seems unlikely that 
the type of driver would interact with the fuel additive to 
an extent that would be of practical relevance. Given these 
considerations, one could actually run a so-called Latin 
square design and “get away” with only 16 individual groups 
(the four additives are denoted by letters A, B, C, and D): 



Car 

1 

2 

3 

4 

Driver 1 

A 

B 

C 

D 

Driver 2 

B 

C 

D 

A 

Driver 3 

C 

D 

A 

B 

Driver 4 

D 

A 

B 

C 


Latin square designs (this term was first used by Euler, 

1782) are described in most textbooks on experimental 

methods and we do not want to discuss here the details of 

how they are constructed. Suffice it to say that this design 

is incomplete insofar as not all combinations of factor levels 

occur in the design. For example, Driver 1 will only drive 

Car 1 with additive A, while Driver 3 will drive that car 

with additive C. In a sense, the levels of the additives factor 

(A, B, C, and D) are placed into the cells of the car by driver 

matrix like eggs into a nest.” This mnemonic device is 

sometimes useful for remembering the nature of nested 
designs. 
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Note that there are several other statistical procedures 
which may be used to analyze these types of designs; see 
the section on Methods for Analysis of Variance for details. 
In particular the methods discussed in the Variance 
Components and Mixed Model ANOVA/ANCOVA chapter 
are very efficient for analyzing designs with unbalanced 
nesting (when the nested factors have different numbers of 
levels within the levels of the factors in which they are 
nested), very large nested designs (e.g., with more than 200 
levels overall), or hierarchically nested designs (with or 
without random factors). 


ANALYSIS OF COVARIANCE (ANCOVA) 

General Idea 

The Basic Ideas section discussed briefly the idea of 
“controlling” for factors and how the inclusion of additional 
factors can reduce the error SS and increase the statistical 
power (sensitivity) of our design. This idea can be extended 
to continuous variables, and when such continuous variables 
are included as factors in the design they are called 
covariates. 

Fixed Covariates 

Suppose that we want to compare the math skills of 
students who were randomly assigned to one of two 
alternative textbooks. Imagine that we also have data about 
the general intelligence (IQ) for each student in the study. 
We would suspect that general intelligence is related to 
math skills, and we can use this information to make our 
test more sensitive. Specifically, imagine that in each one of 
the two groups we can compute the correlation coefficient 
between IQ and math skills. Remember that once we have 
computed the correlation coefficient we can estimate the 
amount of variance in math skills that is accounted for by 
IQ, and the amount of (residual) variance that we cannot 
explain with IQ. We may use this residual variance in the 
ANOVA as an estimate of the true error SS after controlling 
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for IQ. If the correlation between IQ and math skills is 
substantial, then a large reduction in the error SS may be 
achieved. 

Effect of a covariate on the F test: In the F test, to 
evaluate the statistical significance of between-groups 
differences, we compute the ratio of the between-groups 
variance (MS effect ) over the error variance (MS error ). If MS error 
becomes smaller, due to the explanatory power of IQ, then 
the overall F value will become larger. 

Multiple covariates: The logic described above for the 
case of a single covariate (IQ) can easily be extended to the 
case of multiple covariates. For example, in addition to IQ, 
we might include measures of motivation, spatial reasoning, 
etc., and instead of a simple correlation, compute the 
multiple correlation coefficient. 


When the F value gets smaller. In some studies with 
covariates it happens that the F value actually becomes 


smaller (less significant) after including covariates in the 
design. This is usually an indication that the covariates are 
not only correlated with the dependent variable (e.g., math 
skills), but also with the between-groups factors (e.g., the 
two different textbooks). For example, imagine that we 
measured IQ at the end of the semester, after the students 
in the different experimental groups had used the respective 
textbook for almost one year. It is possible that, even though 
students were initially randomly assigned to one of the two 
textbooks, the different books were so different that both 


math skills and IQ improved differentially in the two groups. 
In that case, the covariate will not only partition variance 
away from the error variance, but also from the variance 
due to the between- groups factor. Put another way, after 

—7 7 7 d ' fferences in that were produced by 
Put n ve 7°7 e u math Skllls are not that different, 
have inadve^ 1 H T’ by “ eliminatin S” the effects of IQ, we 

on stude Js’ m 7h y s e krns mated ^ ^ ° f the textb °° ks 
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Adjusted means: When the latter case happens, that 
is. when the covariate is affected by the between-groups 
factor, then it is appropriate to compute so-called adjusted 
means. These are the means that one would get after 
removing all differences that can be accounted for by the 
covariate. 

Interactions between covariates and factors: Just 
as we can test for interactions between factors, we can also 
test foi the intei actions between covariates and between- 
groups factors. Specifically, imagine that one of the textbooks 
is particularly suited for intelligent students, while the other 
actually bores those students but challenges the less 
intelligent ones. As a result, we may find a positive 
correlation in the first group (the more intelligent, the better 
the performance), but a zero or slightly negative correlation 
in the second group (the more intelligent the student, the 
less likely he or she is to acquire math skills from the 
particular textbook). In some older statistics textbooks this 
condition is discussed as a case where the assumptions for 
analysis of covariance are violated. However, because 
AN OVA/MAN OVA uses a very general approach to analysis 
of covariance, you can specifically estimate the statistical 
significance of interactions between factors and covariates. 

Changing Covariates 

While fixed covariates are commonly discussed in 
textbooks on ANOVA, changing covariates are discussed 
less frequently. In general, when we have repeated 
measures, we are interested in testing the differences in 
repeated measurements on the same subjects. Thus we are 
actually interested in evaluating the significance of changes. 
1 f we have a covariate that is also measured at each point 
when the dependent variable is measured, then we can 
compute the correlation between the changes in the covariate 
and the changes in the dependent variable. For example, 
we. could study math anxiety and math skills at the 
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beginning and at the end of the semester. It would be 
interesting to see whether any changes in math anxiety 
over the semester correlate with changes in math skills. 

MULTIVARIATE DESIGNS: MANOVA/MANCOVA 

Between-Groups Designs 

All examples discussed so far have involved only one 
dependent variable. Even though the computations become 
increasingly complex, the logic and nature of the 
computations do not change when there is more than one 
dependent variable at a time. For example, we may conduct 
a study where we try two different textbooks, and we are 
interested in the students’ improvements in math and 
physics. In that case, we have two dependent variables , and 
our hypothesis is that both together are affected by the 
difference in textbooks. We could now perform a multivariate 
analysis of variance (MANOVA) to test this hypothesis . 
Instead of a univariate F value, we would obtain a 
multivariate F value (Wilks’ lambda) based on a comparison 
of the error variance/covariance matrix and the effect 
variance/covariance matrix. The “covariance” here is included 
because the two measures are probably correlated and we 
must take this correlation into account when performing 
the significance test. Obviously, if we were to take the same 
measure twice, then we would really not learn anything 
new. If we take a correlated measure, we gain some new 
information, but the new variable will also contain 
ledundant information that is expressed in the covariance 
between the variables. 

Interpreting results: If the overall multivariate test 
is significant, we conclude that the respective effect ( e.g ., 
textbook) is significant. However, our next question would 
of course be whether only math skills improved, only physics 
skills improved, or both. In fact, after obtaining a significant 

tivariate test foi a particular main effect or interaction, 
customarily one would examine the univariate F tests for 
ac variable to interpret the respective effect. In other 
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words, one would identify the specific dependent variables 
that contributed to the significant overall effect. 

Repeated Measures Designs 

If we were to measure math and physics skills at the 
beginning of the semester and the end of the semester, we 
would have a multivariate repeated measure. Again, the 
logic of significance testing in such designs is simply an 
extension of the univariate case. Note that MANOVA 
methods are also commonly used to test the significance of 
univariate repeated measures factors with more than two 
levels; this application will be discussed later in this section. 

Sum Scores versus MANOVA 

Even experienced users of ANOVA and MANOVA 
techniques are often puzzled by the differences in results 
that sometimes occur when performing a MANOVA on, for 
example, three variables as compared to a univariate 
ANOVA on the sum of the three variables. The logic 
underlying the summing of variables is that each variable 
contains some “true” value of the variable in question, as 
well as some random measurement error. Therefore, 5 by 
summing up variables, the measurement error will sum to 
approximately 0 across all measurements, and the sum score 
will become more and more reliable (increasingly equal to 
the sum of true scores). In fact, under these circumstances, 
ANOVA on sums is appropriate and represents a very 
sensitive (powerful) method. However, if the dependent 
variable is truly multi- dimensional in nature, then summing 
is inappropriate. For example, suppose that my dependent 
measure consists of four indicators of success in society , 
and each indicator represents a completely independent way 
in which a person could “make it” in life (e.g., successful 
professional, successful entrepreneur, successful homemaker, 
etc.). Now, summing up the scores on those variables would' 
e like adding apples to oranges, and the resulting sum 
scoie will not be a reliable indicator of a single underlying 
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dimension. Thus, one should treat such data as multivariate 
indicators of success in a MANOVA. 


CONTRAST ANALYSIS AND POST HOC TESTS 
Why Compare Individual Sets of Means? 

Usually, experimental hypotheses are stated in terms 
that are more specific than simply main effects or 
interactions. We may have the specific hypothesis that a 
particular textbook will improve math skills in males, but 
not in females, while another book would be about equally 
effective for both genders, but less effective overall for males. 
Now generally, we are predicting an interaction here: the 
effectiveness of the book is modified (qualified) by the 
student’s gender. However, we have a particular prediction 
concerning the nature of the interaction: we expect a 
significant difference between genders for one book, but not 
the other. This type of specific prediction is usually tested 
via contrast analysis. 

Contrast Analysis 

Briefly, contrast analysis allows us to test the statistical 
significance of predicted specific differences in particular 
parts of our complex design. It is a major and indispensable 
component of the analysis of every complex ANOVA design. 

Post hoc Comparisons 

Sometimes we find effects in our experiment that were 
not expected. Even though in most cases a creative 
experimenter will be able to explain almost anj' pattern of 
means, it would not be appropriate to analyze and evaluate 
that pattern as if one had predicted it all along. The problem 
here is one of capitalizing on chance when performing 
multiple tests post hoc, that is, without a priori hypotheses. 

To illustrate this point, let us consider the following 
“experiment.” Imagine we were to write down a number 
between 1 and 10 on 100 pieces of paper. We then put all of 
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those pieces into a hat and draw 20 samples (of pieces of 
papof) of 5 obseivations each, and compute the means (from 
the numbers written on the pieces of paper) for each group. 
How likely do you think it is that we will find two sample 
means that are significantly different from each other? It is 
very likely! Selecting the extreme means obtained from 20 
samples is very different from taking only 2 samples from 
the hat in the first place, which is what the test via the 
contrast analysis implies. Without going into further detail, 
there are several so-called post hoc tests that are explicitly 
based on the first scenario (taking the extremes from 20 
samples), that is, they are based on the assumption that we 
have chosen for our comparison the most extreme (different) 
means out of k total means in the design. Those tests apply 
“corrections” that are designed to offset the advantage of 
post hoc selection of the most extreme comparisons. 

ASSUMPTIONS AND EFFECTS OF VIOLATING 
ASSUMPTIONS 

Deviation from Normal Distribution 

Assumptions. It is assumed that the dependent variable 
is measured on at least an interval scale level (see 
Elementary Concepts). Moreover, the dependent variable 
should be normally distributed within groups. 

Effects of violations: Overall, the Ftest is remarkably 
robust to deviations from normality (see Lindman, 1974, for 
a summary). If the kurtosis is greater than 0, then the F 
tends to be too small and we cannot reject the null hypothesis 
even though it is incorrect. The opposite is the case when 
the kurtosis is less than 0. The skewness of the distribution 
usually does not have a sizable effect on the F statistic. If 
the n per cell is fairly large, then deviations from normality 
do not matter much at all because of the central limit 
theorem, according to which the sampling distribution of 
the mean approximates the normal distribution, regardless 
°f the distribution of the variable in the population^ 
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Homogeneity of Variances 

Assumptions: It is assumed that the variances in the 
different groups of the design are identical; this assumption 
is called the homogeneity of variances assumption. Remember 
that at the beginning of this section we computed the error 
variance (SS error) by adding up the sums of squares within 
each group. If the variances in the two groups are different 
from each other, then adding the two together is not 
appropriate, and will not yield an estimate of the common 
within-group variance (since no common variance exists). 

Effects of Violations: Lindman (1974, p. 33) shows 
that the F statistic is quite robust against violations of this 
assumption ( heterogeneity of variances: see also Box, 1954a, 
1954b; Hsu, 1938). 

Special case: correlated means and variances. However, 
one instance when the F statistic is very misleading is when 
the means are correlated with variances across cells of the 
design. A scatterplot of variances or standard deviations 
against the means will detect such correlations. The reason 
why this is a ‘‘dangerous” violation is the following: Imagine 
that you have 8 cells in the design, 7 with about equal 
means but one with a much higher mean. The F statistic 
may suggest to you a statistically significant effect. However, 
suppose that there also is a much larger variance in the cell 
with the highest mean, that is, the means and the variances 
are correlated across cells (the higher the mean the larger 
the variance). In that case, the high mean in the one cell is 
actually quite unreliable, as is indicated by the large 
variance. However, because the overall F statistic is based 
on a pooled within-cell variance estimate, the high mean is 
identified as significantly different from the others, when 
in fact it is not at all significantly different if one based the 
test on the within-cell variance in that cell alone. 

This pattern a high mean and a large variance in one 
cell frequently occurs when there are outliers present in 


4 NOVA /MANOVA 


89 

the data. One or two extreme cases in a cell with only 10 

» can greatly bras the mean, and will dramatically 
e variance. y 


cases 

increase the variance. 


Homogeneity of Variances and Covariances 

Assumptions. In multivariate designs, with multiple 
dependent measures, the homogeneity of variances 
assumption described earlier also applies. However since 
there are multiple dependent variables, it is also required 
that their intercorrelations (covariances) are homogeneous 
across the cells of the design. There are various specific 
tests of this assumption. 

Effects of violations: The multivariate equivalent of 
the F test is Wilks’ lambda. Not much is known about the 
robustness of Wilks’ lambda to violations of this assumption. 
However, because the interpretation of MANOVA results 
usually rests on the interpretation of significant univariate 
effects (after the overall test is significant), the above 
discussion concerning univariate ANOVA basically applies, 
and important significant univariate effects should be 
carefully scrutinized. 

Special Case: ANCOVA. A special serious violation of 
the homogeneity of variances/covariances assumption may 
occur when covariates are involved in the design. Specifically, 
if the correlations of the covariates with the dependent 
measure(s) are very different in different cells of the design, 
gross misinterpretations of results may occur. Remember 
that in ANCOVA, we in essence perform a regression 
analysis within each cell to partition out the variance 
component due to the covariates. The homogeneity of 
variances/covariances assumption implies that we perform 
this regression analysis subject to the constraint that all 
regression equations (slopes) across the cells of the design 
are the same. If this is not the case, serious biases may 
occur. There are specific tests of this assumption, and it is 
advisable to look at those tests to ensure that the regression 
equations in different cells are approximately the same. 
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Sphericity and Compound Symmetry 

Reasons for Using the Multivariate Approach to 
Repeated Measures ANOVA. In repeated measures ANOVA 
containing repeated measures factors with more than two 
levels, additional special assumptions enter the picture: The 
compound- symmetry assumption and the assumption of 
sphericity. Because these assumptions rarely hold (see 
below), the MANOVA approach to repeated measures 
ANOVA has gained popularity in recent years (both tests 
are Automatically computed in ANOVA/MANOVA). The 
compound, symmetry assumption requires that the variances 
(pooled within-group) and covariances (across subjects) of 
the different repeated measures are homogeneous (identical). 
This is a sufficient condition for the univariate F test for 
repeated measures to be valid (i.e., for the reported F values 
to actually follow th^ F distribution). However, it is not a 
necessary condition. The sphericity assumption is a necessary 
and sufficient condition for the F test to be valid; it states 
that the within subject “model” consists of independent 
(orthogonal) components. The nature of these assumptions, 
and the effects of violations are usually not well-described 
in ANOVA textbooks; in the following paragraphs we will 
try to clarify this matter and explain what it means when 
the results of the univariate approach differ from the 
multivariate approach to repeated measures ANOVA. 

The necessity of independent hypotheses: One 
general way of looking at ANOVA is to consider it a model 
fitting procedure. In a sense we bring to our data a set of a 
priori hypotheses; we then partition the variance (test main 
effects, interactions) to test those hypotheses. Compu¬ 
tationally, this approach translates into generating a set of 
contrasts (comparisons between means in the design) that 
specify the main effect and interaction hypotheses. However, 
if these contrasts are not independent of each other, then 
the partitioning of variances runs afoul. For example, if two 
contrasts A and B are identical to each other and we 
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partition out their components from the total variance, then 
we ake the same thing out twice. Intuitively, specifying 
the two (not independent) hypotheses “the mean in Cell 1 is 
higher than the mean in Cell 2” and “the mean in Cell 1 is 
higher than the mean in Cell 2’’ is silly and simply makes 

no sense. Thus, hypotheses must be independent of each 
other, or orthogonal. 

Independent hypotheses in repeated measures: The 
general algorithm implemented will attempt to generate, 
for each effect, a set of independent (orthogonal) contrasts. 
In repeated measures ANOVA, these contrasts specify a set 
of hypotheses about differences between the levels of the 
repeated measures factor. However, if these differences are 
correlated across subjects, then the resulting contrasts are 
no longer independent. For example, in a study where we 
measured learning at three times during the experimental 
session, it may happen that the changes from time 1 to 
time 2 are negatively correlated with the changes from time 
2 to time 3 : subjects who learn most of the material between 
time 1 and time 2 improve less from time 2 to time 3. In 
fact, in most instances where a repeated measure ANOVA 
is used, one would probably suspect that the changes across 
levels are correlated across subjects. However, when this 
happens, the compound symmetry and sphericity 
assumptions have been violated, and independent contrasts 
cannot be computed. 

Effects of violations and remedies: When the 
compound symmetry or sphericity assumptions have been 
violated, the univariate ANOVA table will give erroneous 
lesults. Before multivariate procedures were well 
understood, various approximations were introduced to 
compensate for the violations 

MANOVA approach to repeated measures: To 
summarize, the problem of compound symmetry and 
sphericity pertains to the fact that multiple contrasts 
involved in testing repeated measures effects (with more 
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than two levels) are not independent of each other. However 
they do not need to be independent of each other if we u Se ’ 
multivariate criteria to simultaneously test the statistical 
significance of the two or more repeated measures contrasts 
This “insight” is the reason why MANOVA methods are 
increasingly applied to test the significance of univariate 
repeated measures factors with more than two levels. We 
wholeheartedly endorse this approach because it simply 
bypasses the assumption of compound symmetry and 
sphericity altogether. 


Cases when the MANOVA approach cannot be 
used: There are instances (designs) when the MANOVA 
approach cannot be applied; specifically, when there are 
few subjects in the design and many levels on the repeated 
measures factor, there may not be enough degrees of freedom 
to perform the multivariate analysis. For example, if we 
have 12 subjects and p = 4 repeated measures factors, each 
at k = 3 levels, then the four-way interaction would 
“consume” (k-iy = 2 * = 16 degrees of freedom. However, we 

have only 12 subjects, so in this instance the multivariate 
test cannot be performed. 

Differences in univariate and multivariate results: 
Anyone whose research involves extensive repeated 
measures designs has seen cases when the univariate 
approach to repeated measures ANOVA gives clearly 
different results from the multivariate approach. To repeat 
the point, this means that the differences between the levels 
of the respective repeated measures factors are in some 
way correlated across subjects. Sometimes, this insight by 
itself is of considerable interest. 


Methods for Analysis of Variance 

Several chapters in this textbook discuss methods for 
peiforming analysis of variance. Although many of the 
available statistics overlap in the different chapters, each is 
est suited for particular applications. 
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chaDter disc L ’ near Modcls: This extremely comprehensive 
chaptei discusses a complete implementation of the general 

linear model, and describes the sigma-restricted as well as 

nfoi°mai! ) n arametenZed appt0ach ' This chapter includes 
oimation on incomplete designs, complex analysis of 

covanance designs, nested designs (balanced or unbalanced), 

mixed model ANOVA designs (with random effects), and 

huge balanced ANOVA designs (efficiently). It also contains 

descriptions of six types of Sums of Squares. 


General Stepwise Regression: This chapter discusses 
the between subject designs and multivariate designs which 
are appropriate for stepwise regression as well as discussing 
how to perform stepwise and best-subset model building 
(lor continuous as well as categorical predictors). 

Mixed ANCOVA and Variance Components: This 
chapter includes discussions of experiments with random 
effects (mixed model ANOVA), estimating variance 
components for random effects, or large main effect designs 
(e.g., with factors with over 100 levels) with or without 
random effects, or large designs with many factors, when 
you do not need to estimate all interactions/ 


Experimental Design (DOE): This chapter includes 
discussions of standard experimental designs for industrial/ 
manufacturing applications, including 2**(k-p) and 3**(k- 
P) designs, central composite and non-factorial designs, 
designs for mixtures, D and A optimal designs, and designs 
lor arbitrarily constrained experimental regions 


Repeatability and Reproducibility Analysis (in the 
Process Analysis chapter): This section in the Process 
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Analysis chapter includes a discussion of specialized designs 
for evaluating the reliability and precision of measurement 
systems; these designs usually include two or three random 
factors, and specialized statistics can be computed for 
evaluating the quality of a measurement system (typically 
in industrial/manufacturing applications). 

Breakdown Tables (in the Basic Statistics chapter); 
This chapter includes discussions of experiments with only 
one factor (and many levels), or with multiple factors, when 
a complete ANOVA table is not required. 



5 


Process Analysis 


Sampling Plans 

General Purpose: A common question that quality control 
engineers face is to determine how many items from a batch 
(e.g., shipment from a supplier) to inspect in order to ensure 
that the items (products) in that batch are of acceptable 
quality. For example, suppose we have a supplier of piston 
rings for small automotive engines that our company 
produces, and our goal is to establish a sampling procedure 
(of piston rings from the delivered batches) that ensures a 
specified quality. In principle, this problem is similar to 
that of on-line quality control discussed in Quality Control. 
In fact, you may want to read that section at this point to 
familiarize yourself with the issues involved in industrial 
statistical quality control. 

Acceptance sampling: The procedures described here 
are useful whenever we need to decide whether or not a 
batch or lot of items complies with specifications, without 
having to inspect 100% of the items in the batch. Because 
of the nature of the problem—whether or not to accept a 
batch—these methods are also sometimes discussed under 
the heading of acceptance sampling. 
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Advantages over 100 % inspection: An obvious 
advantage of acceptance sampling over 100% inspection of 
the batch or lot is that reviewing only a sample requires 
less time, effort, and money. In some cases, inspection of an 
item is destructive (e.g., stress testing of steel), and testing 
100% would destroy the entire batch. Finally, from a 
managerial standpoint, rejecting an entire batch or shipment 
(based on acceptance sampling) from a supplier, rather than 
just a certain percent of defective items (based on 100% 
inspection) often provides a stronger incentive to the supplier 
to adhere to quality standards. 


Computational Approach 


In principle, the computational approach to the question 
of how large a sample to take is straightforward. Elementary 
Concepts discusses the concept of the sampling distribution. 
Briefly, if we were to take repeated samples of a particular 
size from a population of, for example, piston rings and 
compute their average diameters, then the distribution of 
those averages (means) would approach the normal 


distribution with a particular mean and standard deviation 
(or standard error: in sampling distributions the term 
standard error is preferred, in order to distinguish the 
variability of the means from the variability of the items in 
the population). Fortunately, we do not need to take repeated 
gjjinples from the population in order to estimate the location 
(mean) and variability (standard error) of the sampling 
distribution. If we have a good idea (estimate) of what the 


variability (standard deviation or sigma) is in the population, 
then we can infer the sampling distribution of the mean. In 
principle, this information is sufficient to estimate the 
sample size that is needed in order to detect a certain change 
in quality (from target specifications). Without going into 
the details about the computational procedures involved, 


let us next review the particular information that the 
engineer must supply in order to estimate required sample 


sizes. 


» 
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Means for H 0 and H 1 

To formalize the inspection process of, for example, a 

shipment of piston rings, we can formulate two alternative 

hypotheses: First, we may hypothesize that the average 

piston ring diameters comply with specifications. This 

hypothesis is called the null hypothesis (HX The second 

and alternatwe hypothesis (H,) is that the diameters of the 

piston rings delivered to us deviate from specifications by 

more than a certain amount. Note that we may specify 

these types of hypotheses not just for measurable variables 

such as diameters of piston rings, but also for attributes. 

For example, we may hypothesize (H t ) that the number of 

defective parts in the batch exceeds a certain percentage. 

Intuitively, it should be clear that the larger the difference 

between H„ and H v the smaller the sample necessary to 
detect this difference 

Alpha and Beta Error Probabilities 

To return to the piston rings example, there are two 
types of mistakes that we can make when inspecting a batch 
of piston rings that has just arrived at our plant. First, we 
may erroneously reject H 0 , that is, reject the batch because 
we erroneously conclude that the piston ring diameters 
deviate from target specifications. The probability of 
committing this mistake is usually called the alpha error 
probability. The second mistake that we can make is to 
erroneously not reject H Q (accept the shipment of piston 
rings), when, in fact, the mean piston ring diameter deviates 
irom the target specification by a certain amount. The 
probability of committing this mistake is usually called the 
eta error probability. Intuitively, the more certain we want 
to be, that is, the lower we set the alpha and beta error 
probabilities, the larger the sample will have to be; in fact, 

n 01 100% certain, we would have to measure 

eveiy single piston ring delivered to our company. 


98 


Statistical Methods in Applied Biology 


Fixed Sampling Plans 

To construct a simple sampling plan, we would first 
decide on a sample size, based on the means under Hq/Hj 
and the particular alpha and beta error probabilities. Then, 
we would take a single sample of this fixed size and, based 
on the mean in this sample, decide whether to accept or 
reject the batch. This procedure is referred to as a fixed 
sampling plan. 


Operating characteristic (OC) curve: The power of 
the fixed sampling plan can be summarized via the operating 
characteristic curve. In that plot, the probability of rejecting 
Hq (and accepting Hj) is plotted on the Y-axis, as a function 
of an actual shift from the target (nominal) specification to 
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the respective values shown on the X-axis of the plot (see 
example below). This probability is, of course, one minus the 
beta error probability of erroneously rejecting Hj and 
accepting H 0 ; this value is referred to as the power of the 
fixed sampling plan to detect deviations. Also indicated in 
this plot are the power functions for smaller sample sizes. 
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Sequential Sampling Plans 

As an alternative to the fixed sampling plan, we could 
randomly choose individual piston rings and record their 
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deviations from specification. As we * 

each piston nng, we could keep a runninu e ,, 

of deviations from specification. Intuitively if H UtmT 
that is, if the average piston ring diameter’in the batch is 
not on target, then we would expect to observe a slowlv 
increasing or decreasing cumulative sum of deviations 
depending on whether the average diameter in the batch h 
larger or smaller than the specification, respectively. It turns 
out that this kind of sequential sampling of individual items 
from the batch is a more sensitive procedure than taking a 
fixed sample. In practice, we continue sampling until we 
either accept or reject the batch. 


4 

Using a sequential sampling plan: Typically, we 
would produce a graph in which the cumulative deviations 
from specification (plotted on the F-axis) are shown for 
successively sampled items (e.g., piston rings, plotted on 
the X-axis). Then two sets of lines are drawn in this graph 
to denote the “corridor” along which we will continue to 
draw samples, that is, as long as the cumulative sum of 
deviations from specifications stays within this corridor, we 
continue sampling. 



If the cumulative sum of deviations steps outside the 
corridor we stop sampling. If the cumulative sum moves 
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above the upper line or below the lowest line, we reject the 
batch. If the cumulative sum steps out of the corridor to the 
inside, that is, if it moves closer to the centerline, we accept 
the batch (since this indicates zero deviation from 
specification). Note that the inside area starts only at a 
certain sample number,' this indicates the minimum number 
of samples necessary to accept the batch (with the current 
error probability). 


PROCESS (MACHINE) CAPABILITY ANALYSIS 


Introductory Overview 

Quality Control describes numerous methods for 
monitoring the quality of a production process. However, 
once a process is under control the question arises, “to what 
extent does the long-term performance of the process comply 
with engineering requirements or managerial goals?” For 
example, to return to our piston ring example, how many of 
the piston rings that we are using fall within the design 
specification limits? In more general terms, the question is, 
“how capable is our process (or supplier) in terms of 
producing items within the specification limits?” Most of 
the procedures and indices described here were only recently 
introduced to the US by Ford Motor Company (Kane, 1986). 
They allow us to summarize the process capability in terms 
of meaningful percentages and indices. 


In this topic, the computation and interpretation of 
process capability indices will first be discussed for the 
normal distribution case. If the distribution of the quality 
characteristic of interest does not follow the normal 
distiibution, modified capability indices can be computed 
based on the percentiles of a fitted non-normal distribution. 


Order of business: Note that it makes little sense to 

tp ^ nme P 10cess capability if the process is not in control. 

. n 6 m . ean f of ® uccess ively taken samples fluctuate widely, 

nvnhlo C y i^ e tai get specification, then those quality 
jnohlnms should he addressed first Therefore, the first step 




Process Analysis 


101 


towards a high-quality process is to bring the process under 

c“ i USmg 8 teChniqUeS avaUal >le in QuaUfy 


Computational Approach 

Once a process is in control, we can ask the question 
concerning the process capability. Again, the approach to 
answering this question is based on “statistical” reasoning 
and is actually quite similar to that presented earlier in the 
context of sampling plans. To return to the piston ring 
example, given a sample of a particular size, we can estimate 
the standard deviation of the process, that is, the resultant 
ring diameters. We can then draw a histogram of the 
distribution of the piston ring diameters. As we discussed 
earlier, if the distribution of the diameters is normal, then 
we can make inferences concerning the proportion of piston 
rings within specification limits. 



Let us now review some of the major indices that are 
commonly used to describe process capability. 

Capability Analysis—Process Capability Indices 

Process range: First, it is customary to establish the ± 
3 sigma limits around the nominal specifications. Actually, 




the sigma limits should be the same as the ones used to 
bring the process under control using Shewhart control 
charts. These limits denote the range of the process ( i. e 
process range). If we use the ± 3 sigma limits then, based 
on the normal distribution, we can estimate that 
approximately 99% of all piston rings fall within these limits. 

Specification limits LSL, USL: Usually, engineering 
requirements dictate a range of acceptable values. In our 
example, it may have been determined that acceptable values 
for the piston ring diameters would be 74.0 ± .02 millimeters. 
Thus, the lower specification limit (LSL) for our process is 
74.0 - 0.02 = 73.98; the upper specification limit (USL) is 
74.0 + 0.02 = 74.02. The difference between USL and LSL 
is called the specification range. 

Potential capability (C p ): This is the simplest and 
most straightforward indicator of process capability. It is 
defined as the ratio of the specification range to the process 
range, using ± 3 sigma limits we can express this index as: 


C p - (USL-LSL)/(6*Sigma) 

Put into words, this ratio expresses the proportion of 
the range of the normal curve that falls within the 
engineering specification limits (provided that the mean is 
on target, that is, that the process is centered). 


Bhote (1988) reports that prior to the widespread use of 
statistical quality control techniques (prior to 1980), the 
normal quality of US manufacturing processes was 
approximately C p = .67. This means that the two 33/2 percent 

“ T ma ' CUrve faU outside specification limits. 

As of 1988 only about 30% of US processes are at or below 
this level of quality. 


than 1 rS6 ’ We W ° Uld like this index to ^ greater 

o that no Jn "1 W0UM like t0 3chieve a P™ess capability 

rteisti° St n0) / temS feU ° UtSlde relation 

estingly, in the early 1980’s the Japanese 
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manufacturing industry adopted as their standard C = 1.33! 
The process capability required to manufacture high-tech 
products is usually even higher than this; Minolta has 
established a C p index of 2.0 as their minimum standard 
(Bhote, 1988, p. 53), and as the standard for its suppliers. 
Note that high process capability usually implies lower, not 
highei costs, taking into account the costs due to poor 
quality. We will return to this point shortly. 

Capability ratio (C r ): This index is equivalent to C ; 
specifically, it is computed as 1/C p (the inverse of C p ). 

Lower/upper potential capability: C v C u . A major 
shortcoming of the C p (and C r ) index is that it may yield 
eironeous information if the process is not on target, that 
is, if it is not centered . We can express non-centering via 
the following quantities. First, upper and lower potential 
capability indices can be computed to reflect the deviation 
of the observed process mean from the LSL and USL. 
Assuming ± 3 sigma limits as the process range, we compute: 

C p i - (Mean - LSL)/3*Sigma 

and 

C pu = (USL - Mean)/3*Sigma 

Obviously, if these values are not identical to each other, 
then the process is not centered. 

Non-centering correction (K): We can correct C for 

the effects of non-centering. Specifically, we can compute: 

K=abs(D - Mean)/(1/2*(USL - LSL)) 
where 

D = (USL+LSL)/2. 

This correction factor expresses the non-centering (target 
specification minus mean) relative to the specification range. 

Demonstrated excellence (C pk ): Finally, we can adjust 
C P f° r the effect of non-centering by computing: 



104 


Statistical Methods in Applied Biol ogy 


Cpk = < 1 ' k >* c p 

If the process is perfectly centered, then k is equal to 
zero, and C pk is equal to C p . However, as the process drifts 
from the target specification, k increases and C pk becomes 
smaller than C p . 

Potential Capability II: C pm . A recent modification 
(Chan, Cheng, & Spiring, 1988) to C p is directed at adjusting 
the estimate of sigma for the effect of (random) non¬ 
centering. Specifically, we may compute the alternative 
sigma (Sigma 2 ) as: 

Sigma 2 = {S(xj - TS) 2 /(n-l)} ,/a 

Where: 

Sigma 9 is the alternative estimate of sigma 
x i is the value of the i’th observation in the sample 
TS is the target or nominal specification 
n is the number of observations in the sample 

We may then use this alternative estimate of sigma to 
compute C p as before; however, we will refer to the resultant 
index as C pm . 

Process Performance vs. Process Capability 

When monitoring a process via a quality control chart, 
it is often useful to compute the capability indices for the 
process. Specifically, when the data set consists of multiple 
samples, such as data collected for the quality control chart, 
then one can compute two different indices of variability m 
the data. One is the regular standard deviation for all 
observations, ignoring the fact that the data consist of 
multiple samples; the other is to estimate the process s 
inherent variation from the within-sample variability. For 
example, when plotting X-bar and R-charts one may use 
the common estimator R-bar/d 9 for the process sigma. Note 
however, that this estimator is only valid if the process is 
statistically stable. For a detailed discussion of the difference 
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between the total process variation and the inherent 
MAG, 1 < 1991) 6r t0 ASQC/AIAG ref erence manual (ASQC/ 

capabilitv^m T™* iS USed in the standard 

XrS C ° mpUtatl0ns - the resulting indices are usually 

the actual^ P T eSS Performance indice s (as they describe 

comnuX ^i Pe l 0mance of Process), while indices 
puted from the inherent variation (within- sample sigma) 

are le erre to as capability indices (since they describe the 
inherent capability of the process). 

Using Experiments to Improve Process Capability 

AS mentWned before ' ^e higher the C index, the better 

relat P T S ~Tu d theie lS VirtUally n0 u PP e r limit to this 
relationship. The issue of quality costs, that is, the losses 

to poor quality, is discussed in detail in the context of 

guchx robust design methods. In general, higher quality 

usually results in lower costs overall; even though the costs 

of production may increase, the losses due to poor quality 

for example, due to customer complaints, loss of market 

share, etc are usually much greater. In practice, two or 

three well-designed experiments carried out over a few weeks 

Ca ." 0 b f n acluev ® a C p of 5 or higher. If you are not familiar 
with the use of designed experiments, but are concerned 
with the quality of a process. 


Testing the Normality Assumption 

The indices we have just reviewed are only meaningful 
if, in fact, the quality characteristic that is being measured 
is normally distributed. A specific test of the normality 
assumption is available; these tests are described in most 
statistics textbooks. 


Tolerance Limits 

Before the introduction of process capability indices in 
the early 1980’s, the common method for estimating the 
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characteristics of a production process was to estimate 
examine the tolerance limits of the process (see, for exam 1 
Hald, 1952). The logic of this procedure is as follows. Lefu 
assume that the respective quality characteristic is normally 
distributed in the population of items produced; we can 
then estimate the lower and upper interval limits that will 
ensure with a certain level of confidence (probability) that a 

certain percent of the population is included in those limits 
Put another way, given: 


a specific sample size (n), 
the process mean, 

the piocess standard deviation (sigma), 
a confidence level, and 

the percent of the population that we want to be included in 
the interval, 

we can compute the corresponding tolerance limits that will 
satisfy all these parameters. You can also compute parameter- 
fiee toleiance limits that are not based on the assumption of 
normality. 


GAGE REPEATABILITY AND REPRODUCIBILITY 
Introductory Overview 

Gage repeatability and reproducibility analysis addresses 
the issue of precision of measurement. The purpose of 
repeatability and reproducibility experiments is to determine 
the proportion of measurement variability that is due to ( 1 ) 
the items or parts being measured (part-to-part variation), 
( 2 ) the operator or appraiser of the gages ( reproducibility ), 
and (3) errors (unreliabilities) in the measurements over 
several trials by the same operators of the same parts 
0 repeatability ). In the ideal case, all variability in measure¬ 
ments will be due to the part-to-part variation, and only a 
negligible proportion of the variability will be due to operator 
reproducibility and trial-to-trial repeatability. 

To return to the piston ring example, if we require 
detection of deviations from target specifications of the 
magnitude of .01 millimeters, then we obviously need to 
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use gages of sufficient precision. The procedures described 
here allow the engineer to evaluate the precision of gages 
and different operators (users) of those gages, relative to 
the variability of the items in the population. 

You can compute the standard indices of repeatability, 
reproducibility, and part-to-part variation, based either on 
langes (as is still common in these types of experiments) or 
from the analysis of variance (ANOVA) table (as, for 
example, recommended in ASQC/AIAG, 1990, page 65). The 
ANOVA table will also contain an F test (statistical 
significance test) for the operator-by-part interaction, and 
report the estimated variances, standard deviations, and 
confidence intervals for the components of the ANOVA 
model. 

Finally, you can compute the respective percentages of 
total variation, and report so-called percent-of-tolerance 
statistics. These measures are briefly discussed in the 
following sections of this introduction. 

Computational Approach 

One may think of each measurement as consisting of 
the following components: a component due to the 
characteristics of the part or item being measured, a 
component due to the reliability of the gage, and a component 
due to the characteristics of the operator (user) of the gage. 

The method of measurement (measurement system) is 
reproducible if different users of the gage come up with 
identical or very similar measurements. A measurement 
method is repeatable if repeated measurements of the same 

part produce identical results. Both of these characteristics_ 

repeatability and reproducibility—will affect the precision 
of the measurement system. 

We can design an experiment to estimate the magnitudes 
of each component, that is, the repeatability, reproducibility, 
and the variability between parts, and thus assess the 
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precision of the measurement system. In essence, this 
procedure amounts to an analysis of variance on an 
experimental design, which includes as factors different 
parts, operators, and repeated measurements (trials). We 
can then estimate the corresponding variance components 
(the term was first used by Daniels, 1939) to assess the 
repeatability (variance due to differences across trials), 
reproducibility (variance due to differences across operators), 
and variability between parts (variance due to differences 
across parts). 

Plots of Repeatability and Reproducibility 

There are several ways to summarize via graphs the 
findings from a repeatability and reproducibility experiment. 
For example, suppose we are manufacturing small kilns 
that are used for drying materials for other industrial 
production processes. The kilns should operate at a target 
temperature of around 100 degrees Celsius. In this study, 5 
different engineers (operators) measured the same sample 
of 8 kilns (parts), three times each (three trials). We can 
plot the mean ratings of the 8 parts by operator. If the 
measurement system is reproducible, then the pattern of 
means across pails should be quite consistent across the 5 
engineers who participated in the study. 

R and S charts: Quality Control discusses in detail the 
idea of/? (range) and S (sigma) plots for controlling process 
variability. We can apply those ideas here and produce a 
plot of ranges (or sigmas ) by operators or by parts; these 
plots will allow us to identify outliers among operators or 
parts. If one operator produced particularly wide ranges of 
measurements, we may want to find out why that particular 
person had problems producing reliable measurements ( e.g ., 
perhaps he or she failed to understand the instructions for 
using the measurement gage). 
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Repeatability and reproducibility summary plot. The 
summary plot shows the individual measurements by each 
operator; specifically, the measurements are shown in terms 
of deviations from the respective average rating for the 


Analogously, producing an R chart by parts may allow 

us to identify parts that are particularly difficult to measure 

re a y, a S a in, inspecting that particular part may give us 

some insights into the weaknesses in our measurement 
system. 
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respective part. A point represents each trial, and a vertical 
line connects the different measurement trials for each 
operator for each part. Boxes drawn around the 
measurements give us a general idea of a particular 
operator’s bias (see graph below). 



COMPONENTS OF VARIANCE 

Percent of Process Variation and Tolerance: The 

Percent Tolerance allows you to evaluate the performance of 
the measurement system with regard to the overall process 
variation, and the respective tolerance range. You can specify 
the tolerance range (Total tolerance for parts ) and the 
Number of sigma intervals. The latter value is used in the 
computations to define the range (spread) of the respective 
(repeatability, reproducibility, part-to-part, etc.) variability. 
Specifically, the default value (5.15) defines 5.15 times the 
respective sigma estimate as the respective range of values; 
if the data are normally distributed, then this range defines 
99 % of the space under the normal curve, that is, the range 
that will include 99% of all values (or reproducibility/ 
repeatability errors) due to the respective source of variation. 

Percent of process variation: This value reports the 
variability due to different sources relative to the total 
variability (range) in the measurements. 
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Analysis of Variance: Rather than computing variance 
components estimates based on ranges, an accurate method 
for computing these estimates is based on the ANOVA mean 
squares. Customarily, it is assumed that all interaction 
elfects by the trial factor are non-significant. This 
assumption seems reasonable, since, for example, it is 
difficult to imagine how the measurement of some parts 
will be systematically different in successive trials, in 
particular when parts and trials are randomized. 


However, the Operator by Parts interaction may be 
important. For example, it is conceivable that certain less 
experienced operators will be more prone to particular biases, 
and hence will arrive at systematically different 
measurements for particular parts. If so, then one would 
expect a significant two-way interaction (again, refer to 

a™ ANOVA/MAN OVA if you are not familiar with 
ANOVA terminology). 


n the case when the two-way interaction is statistically 
significant, then one can separately estimate the variance 
components due to operator variability, and due to the 
operator by parts variability. 


n the case of significant interactions, the combined 

repeatability and reproducibility variability is defined as 

the sum of three components: repeatability (gage error) 

operator variability, and the operator-by-part variability If 

the Operator by Part interaction is not statistically 

significant a simpler additive model can be used without 
interactions. 


NON-NORMAL DISTRIBUTIONS 

Introductory Overview 


General Purpose. The concept of process capability i 
described in detail in the Process Capability Overview. T 
reiterate, when judging the quality of a (e.g., production 
Process it is useful to estimate the proportion of item 
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produced that fall outside a predefined acceptable 
specification range. For example, the so-called C indpv 

i n is 

computed as: 

Cp - (USL-LSL)/( 6 *SIGMA) 

where sigma is the estimated process standard deviation 
and USL and LSL are the upper and lower specification 
limits, respectively. If the distribution of the respective 
quality characteristic or variable (e.g., size of piston rings) 
is normal, and the process is perfectly centered (i.e., the 
mean is equal to the design center), then this index can be 
interpreted as the proportion of the range of the standard 
normal curve (the process width) that falls within the 
engineering specification limits. If the process is not centered, 
an adjusted index C pk is used instead. 

Non-Normal Distributions: You can fit non-normal 
distributions to the observed histogram, and compute 
capability indices based on the respective fitted non-normal 
distribution (via the percentile method). In addition, instead 
of computing capability indices by fitting specific 
distributions, you can compute capability indices based on 
two different general families of distributions- the Johnson 
distributions (Johnson, 1965; see also Hahn and Shapiro, 
1967) and Pearson distributions (Johnson, Nixon, Amos' 
and Peaison, 1963; Gruska, Mirkliani, and Lamberson, 1989; 
Pearson and Hartley, 1972)-which allow the user to 
approximate a wide variety of continuous distributions. For 
all distributions, the user can also compute the table of 
expected frequencies, the expected number of observations 
beyond specifications, and Quantile-quantile and probability- 
probability plots. The specific method for computing process 

capability indices from these distributions is described in 
Clements (1989). 

Quantile-quantile plots and probability-probability plots, 
lere are various methods for assessing the quality of 
respective fit to the observed data. In addition to the table 
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of observed and expected frequencies for different intervals, 
an t e Kolmogorov-Smirnov and Chi-square goodness-of- 
tit tests, you can compute quantile and probability plots for 
a „ lstl ibutions. These scatterplots are constructed so that 
it the ebserved values follow the respective distribution, 
then the points will form a straight line in the plot. These 
plots are described further below. 

* ittin S Distributions by Moments 

In addition to the specific continuous distributions 

described above, you can fit general “families” of 

distributions the so-called Johnson and Pearson curves— 

wit t e goal to match the first four moments of the observed 
distribution. 

General approach: The shapes of most continuous 
distributions can be sufficiently summarized in the first 
our moments. Put another way, if one fits to a histogram of 
o served data a distribution that has the same mean (first 
moment), variance (second moment), skewness (third 
moment) and kurtosis (fourth moment) as the observed data, 
then one can usually approximate the overall shape of the 
distribution very well. Once a distribution has been fitted, 
one can then calculate the expected percentile values under 
the (standardized) fitted curve, and estimate the proportion 

of items produced by the process that fall within the 
specification limits. 

Johnson curves: Johnson (1949) described a system of 
frequency curves that represents transformations of the 
standard normal curve (see Hahn and Shapiro, 1967, for 
details). By applying these transformations to a standard 
normal variable, a wide variety of non- normal distributions 
can be approximated, including distributions, which are 
bounded on either one or both sides (e.g., H-shaped 
distributions). The advantage of this approach is that once 
a particular Johnson curve has been fit, the normal integral 
can be used to compute the expected, percentage points under 
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the respective curve. Methods for fitting Johnson curves, so 
as to approximate the first four moments of an empirical 
distribution, are described in detail in Hahn and Shapiro, 
1967, pages 199-220; and Hill, Hill, and Holder, 1976. 


Pearson curves: Another system of distributions was 
proposed by Karl Pearson (e.g., see Hahn and Shapiro, 1967, 
pages 220-224). The system consists of seven solutions (of 
12 originally enumerated by Pearson) to a differential 
equation, which also approximate a wide range of 
distributions of different shapes. Gruska, Mirkhani, and 
Lamberson (1989) describe in detail how the different 
Pearson curves can be fit to an empirical distribution. A 
method for computing specific Pearson percentiles is also 
described in Davis and Stephens (1983). 


Assessing the Fit: Quantile and Probability Plots 

For each distribution, you can compute the table of 
expected and observed frequencies and the respective Chi- 
square goodness-of-fit test, as well as the Kolmogorov- 
Smirnov d. test. However, the best way to assess the quality 
of the fit of a theoretical distribution to an observed 
distribution is to review the plot of the observed distribution 
against the theoretical fitted distribution. There are two 
standard types of plots used for this purpose: Quantile- 
quantile plots and probability-probability plots. 

Quantile-quantile plots: In quantile-quantile plots (or 
Q-Q plots for short), the observed values of a variable are 
plotted against the theoretical quantiles. To produce a Q-Q 
plot, you first sort the n observed data points into ascending 
order, so that: 

Xj < X 2 < ... < X N 

These observed values are plotted against one axis of 
the graph; on the other axis the plot will show: 
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where i is the rank of the respective observation, r .. and 
n adj are adjustment factors (< 0.5) and F 1 denotes the 
inverse of the probability integral for the respective 
stan ardized distribution. The resulting plot (see example 
e ow) is a scatterplot of the observed values against the 
S . ^ 1 *!, ar< ^ ze d) expected values, given the respective 
s n ution. Note that, in addition to the inverse probability 
in egr value, you can also show the respective cumulative 
probability values on the opposite axis, that is, the *plot will 
show not only the standardized values for the theoretical 
s nbution, but also the respective p -values. 



™ S .. pl0 ‘ would indicate a good fit of the theoretical 
distribution to the observed values if the plotted values fall 
onto a straight line. Note that the adjustment factors r 
and n a . ensure that the p-value for the inverse probabilitt 
integrsd will fall between 0 and 1, but not including 0 and h 


Probab.hty-probability plots: In probabilitv- 

probability plots (or P-P plots for short) the observed 
cumulative distribution function is plotted against the 
theoretica 1 cumulative distribution function. As in the Q-Q 

toS’a H alUeS °1 th l. reSpeCtive are first sorted 

into ascending order. The i’th observation is plotted against 
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one axis as i/n (i.e., the observed cumulative distribution 
function), and against the other axis as F(x (i) ), where F(x (i} ) 
stands for the value of the theoretical cumulative distribution 
function for the respective observation x (i) . If the theoretical 
cumulative distribution approximates the observed 
distribution well, then all points in this plot should fall 
onto the diagonal line (as in the graph below). 
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Non-Normal Process Capability Indices (Percentile 
Method) 


As described earlier, process capability indices are 
generally computed to evaluate the quality of a process, 
that is, to estimate the relative range of the items 
manufactured by the process (process width) with regard to 
the engineering specifications. For the standard, normal- 
distribution-based, process capability indices, the process 
width is typically defined as 6 times sigma , that is, as plus/ 
minus 3 times the estimated process standard deviation. 
For the standard normal curve, these limits (z l = -3 and zu 
= +3) translate into the 0.135 percentile and 99.865 
percentile, respectively. In the non-normal case, the 3 times 
sigma limits as well as the mean (z M = 0.0) can be replaced 
by the corresponding standard values, given the same 
percentiles, under the non-normal curve. This procedure is 
described in detail by Clements (1989). 










Process Analysis 

Process capability indices: Shown below are the 
formulas for the non-normal process capability indices: 

C p = (USL-LSL)/(U p -L) 

C pL = (M-LSL)/(M-L p ) 
r = (USL-M)/(U -M) 

Cpk = M in(C pll , C pL ) 

In these equations, M represents the 50’th percentile 
value for the respective fitted distribution, and U and L 
are the 99.865 and .135 percentile values, respectively, if 
the computations are based on a process width of ±3 times 
sigma. Note that the values for U p and L may be different, 
if the process width is defined by different sigma limits 
(e.g., ±2 times sigma). 

WEIBULL AND RELIABILITY/FAILURE TIME 
ANALYSIS 

A key aspect of product quality is product reliability. A 
number of specialized techniques have been developed to 
quantify reliability and to estimate the “life expectancy” of 
a product. 

General Purpose 

The reliability of a product or component constitutes an 
important aspect of product quality. Of particular interest 
is the quantification of a product’s reliability, so that one 
can derive estimates of the product’s expected useful life. 
For example, suppose you are flying a small single engine 
aircraft. It would be very useful (in fact vital) information 
to know what the probability of engine failure is at different 
stages of the engine’s “life” (e.g., after 500 hours of operation, 
1000 hours of operation, etc.). Given a good estimate of the 
engine’s reliability, and the confidence limits of this estimate, 
one can then make a rational decision about when to swap 
or overhaul the engine. 




118 


Statistical Methods in Applied Biol 


ogy 


The Weibull Distribution 


A useful general distribution for describing failure time 
data is the Weibull distribution. The distribution is named 
after the Swedish professor Waloddi Weibull, who 
demonstrated the appropriateness of this distribution for 
modeling a wide variety of different data sets (see also Hahn 
and Shapiro, 1967; for example, the Weibull distribution 
has been used to model the life times of electronic 
components, relays, ball bearings, or even some businesses). 

Hazard function and the bathtub curve: It is often 
meaningful to consider the function that describes the 
probability of failure during a very small time increment 
(assuming that no failures have occurred prior to that time). 
This function is called the hazard function (or, sometimes, 
also the conditional failure, intensity, or force of mortality 
function), and is generally defined as: 


h(t) = f(t)/(l-F(t)) 


where h(t) stands for the hazard function (of time t ), and 
f(t) and F(t) are the probability density and cumulative 
distribution functions, respectively. The hazard (conditional 
failure) function for most machines (components, devices) 
can best be described in terms of the “bathtub” curve: Very 
early during the life of a machine, the rate of failure is 
relatively high (so-called Infant Mortality Failures); after 
all components settle, and the electronic parts are burned 
in, the failure rate is relatively constant and low. Then, 
after some time of operation, the failure rate again begins 
to increase (so-called Wear-out Failures), until all 
components or devices will have failed. 

For example, new automobiles often suffer several small 
failures right after they were purchased. Once these have 
been “ironed out,” a (hopefully) long relatively trouble-free 
period of operation will follow. Then, as the car reaches a 
particular age, it becomes more prone to breakdowns, until 
finally, after 20 years and 250000 miles, practically all cars 
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will have failed. A typical bathtub hazard function is shown 
below. 



The Weibull distribution is flexible enough for modeling 
the key stages of this typical bathtub-shaped hazard 
function. Shown below are the hazard functions for shape 
parameters c=.5, c=l, c=2, and c=5. 
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Clearly, the early (“infant mortality”) “phase” of the 
bathtub can be approximated by a Weibull hazard function 
with shape parameter c<l ; the constant hazard phase of 
the bathtub can be modeled with a shape parameter c=l 
and the final (“wear-out”) stage of the bathtub with c>l. 
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Cumulative distribution and reliability functions: 

Once a Weibull distribution (with a particular set of 
parameters) has been fit to the data, a number of additional 
important indices and measures can be estimated. For 
example, you can compute the cumulative distribution 
function (commonly denoted as F(t )) for the fitted 
distribution, along with the standard errors for this function. 
Thus, you can determine the percentiles of the cumulative 
survival (and failure) distribution, and, for example, predict 
the time at which a predetermined percentage of components 
can be expected to have failed. 


The reliability function (commonly denoted as R(t)) is 
the complement to the cumulative distribution function (i.e., 
R(t)=l-F(t)); the reliability function is also sometimes 
referred to as the survivorship or survival function (since it 
describes the probability of not failing or of surviving until 
a certain time t ; e.g., see Lee, 1992). Shown below is the 
reliability function for the Weibull distribution, for different 
shape parameters. 



For shape parameters less than 1, the reliability 
decreases sharply very early in the respective product’s life, 
and then slowly thereafter. For shape parameters greater 
than 1, the initial drop in reliability is small, and then the 
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reliability drops relatively sharply at some point later in 
time. The point where all curves intersect is called the 
characteristic life : regardless of the shape parameter, 63.2 
peicent of the population will have failed at or before this 
point (i.e., R(t) — 1-0.632 = .368). This point in time is also 
equal to the respective scale parameter b of the two- 
parameter Weibull distribution (with 0= 0; otherwise it is 
equal to b+0). 

Censored Observations 

In most studies of product reliability, not all items in 
the study will fail. In other words, by the end of the study 
the lesearcher only knows that a certain number of items 
have not failed for a particular amount of time, but has no 
knowledge of the exact failure times (i.e., “when the items 
would have failed ). Those types of data are called censored 
observations. The issue of censoring, and several methods 
foi analyzing censored data sets, are also described in great 
detail in the context of Survival Analysis. Censoring can 
occur in many different ways. 

Type I and II censoring: So-called Type I censoring 
describes the situation when a test is terminated at a 
particular point in time, so that the remaining items are 
only known not to have failed up to that time (e.g., we start 
with 100 light bulbs, and terminate the experiment after a 
certain amount of time). In this case, the censoring time is 
often fixed, and the number of items failing is a random 
variable. In Type II censoring the experiment would be 
continued until a fixed proportion of items have failed ( e.g ., 
we stop the experiment after exactly 50 light bulbs have 
failed). In this case, the number of items failing is fixed, 
and time is the random variable. 

Left and right censoring: An additional distinction 
can be made to reflect the “side” of the time dimension at 
which censoring occurs. In the examples described above, 
the censoring always occurred on the right side (right 
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censoring), because the researcher knows when exactly th 

experiment started, and the censoring always occurs on the 

right side of the time continuum. Alternatively, it 6 

conceivable that the censoring occurs on the left side (left 

censoiing). For example, in biomedical research one may 

know that a patient entered the hospital at a particular 

date, and that s/he survived for a certain amount of time 

thereafter; however, the researcher does not know when 

exactly the symptoms of the disease first occurred or were 
diagnosed. 


Single and multiple censoring: Finally, there are 
situations in which censoring can occur at different times 
{multiple censoring), or only at a particular point in time 
(single censoring). To return to the light bulb example, if 
the experiment is terminated at a particular point in time, 
then a single point of censoring exists, and the data set is 
said to be single-censored. However, in biomedical research 
multiple censoring often exists, for example, when patients 
are discharged from a hospital after different amounts 
(times) of treatment, and the researcher knows that the 
patient survived up to those (differential) points of censoring. 

The methods described in this section are applicable 
primarily to right censoring, and single- as well as multiple- 
censored data. 


Two- and three-parameter Weibull distribution 

The Weibull distribution is bounded on the left side. If 
you look at the probability density function, you can see 
that that the term x- Qmust be greater than 0. In most 
cases, the location parameter Q(theta) is known (usually 0): 
it identifies the smallest possible failure time. However, 
sometimes the probability of failure of an item is 0 (zero) 
for some time after a study begins, and in that case it may 
be necessary to estimate a location parameter that is greater 
than 0. There are several methods for estimating the location 
parameter of the three-parameter Weibull distribution. To 
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identify situations when the location parameter is greater 
than 0, Dodson (1994) recommends to look for downward of 
upward sloping tails on a probability plot (see below), as 
well as large (>6) values for the shape parameter after 
fitting the two-parameter Weibull distribution, which may 
indicate a non-zero location parameter. 

PARAMETER ESTIMATION 

Maximum likelihood estimation: Standard iterative 
function minimization methods can be used to compute 
maximum likelihood parameter estimates for the two- and 
three parameter Weibull distribution. The specific methods 
for estimating the parameters are described in Dodson 
(1994); a detailed description of a Newton-Raphson iterative 
method for estimating the maximum likelihood parameters 
for the two-parameter distribution is provided in Keats and 
Lawrence (1997). 

The estimation of the location parameter for the three- 
parameter Weibull distribution poses a number of special 
problems, which are detailed in Lawless (1982). Specifically, 
when the shape parameter is less than 1, then a maximum 
likelihood solution does not exist for the parameters. In 
other instances, the likelihood function may contain more 
than one maximum (i.e., multiple local maxima). In the 
latter case, Lawless basically recommends using the smallest 
failure time (or a value that is a little bit less) as the estimate 
of the location parameter. 

Nonparametric (rank-based) probability plots: One 
can derive a descriptive estimate of the cumulative 
distribution function (regardless of distribution) by first 
rank-ordering the observations, and then computing any of 
the following expressions: 

Median rank: 

F(t) = (j-0.3)/(n+0.4) 



124 


Statistical Methods in AddHp/1 ©.* , 



Mean rank: 

F(t) = j/(n+l) 

White’s plotting position: 

F(t) = 0*-3/8)/(n+l/4) 

where j denotes the failure order (rank; for multiple-censored 
data a weighted average ordered failure is computed; see 
Dodson, p. 21), and n is the total number of observations 
One can then construct the following plot. 



Note that the horizontal Time axis is scaled logarith¬ 
mically; on the vertical axis the quantity log(log(1001(100- 
F(t))) is plotted (a probability scale is shown on the left-y 
axis). From this plot the parameters of the two-parameter 
Weibull distribution can be estimated; specifically, the shape 
parameter is equal to the slope of the linear fit-line, and 
the scale parameter can be estimated as exp(-interceptlslope). 

Estimating the location parameter from probability plots. 
It is apparent in the plot shown above that the regression 
line provides a good fit to the data. When the location 
parameter is misspecified (e.g ., not equal to zero), then the 
linear fit is worse as compared to the case when it is 
appropriately specified. Therefore, one can compute the 
probability plot for several values of the location parameter, 
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and observe the quality of the fit. These computations are 
summarized in the following plot. 
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Here the common R-square measure (correlation 
squared) is used to express the quality of the linear fit in 
the probability plot, for different values of the location 
parameter shown on the horizontal x axis (this plot is based 
on the example data set in Dodson, 1994, Table 2.9). This 
plot is often very useful when the maximum likelihood 
estimation procedure for the three-parameter Weibull 
distribution fails, because it shows whether or not a unique 

(single) optimum value for the location parameter exists (as 
in the plot shown above). 


Hazard plotting. Another method for estimating the 
parameters for the two-parameter Weibull distribution is 
via hazard plotting (as discussed earlier, the hazard function 
describes the probability of failure during a very small time 
increment, assuming that no failures have occurred prior to 
that time). This method is very similar to the probability 
plotting method. First plot the cumulative hazard function 
against the logarithm of the survival times; then fit a linear 
regression line and compute the slope and intercept of that 
line. As in probability plotting, the shape parameter can 
then be estimated as the slope of the regression line, and 
the scale parameter as exp(-intercept / slope). 


Statistical Methods in Applied Biology 

Method of moments: This method—to approximate 
the moments of the observed distribution by choosing the 
appropriate parameters for the Weibull distribution—is also 
widely described in the literature. In fact, this general 
method is used for fitting the Johnson curves general non¬ 
normal distribution to the data, to compute non-normal 
process capability indices. However, the method is not suited 
for censored data sets, and is therefore not very useful for 
the analysis of failure time data. 

Comparing the estimation methods. Dodson (1994) 
reports the result of a Monte Carlo simulation study, 
comparing the different methods of estimation. In general, 
the maximum likelihood estimates proved to be best for 
large sample sizes ( e.g ., n>15), while probability plotting 
and hazard plotting appeared to produce better (more 
accurate) estimates for smaller samples. 

A note of caution regarding maximum likelihood based 
confidence limits. Many software programs will compute 
confidence intervals for maximum likelihood estimates, and 
for the reliability function based on the standard errors of 
the maximum likelihood estimates. Dodson (1994) cautions 
against the interpretation of confidence limits computed from 
maximum likelihood estimates, or more precisely, estimates 
that involve the information matrix for the estimated 
parameters. When the shape parameter is less than 2, the 
variance estimates computed for maximum likelihood 
estimates lack accuracy, and it is advisable to compute 
the various results graphs based on nonparametric 
confidence limits as well. 

Goodness of Fit Indices 

A number of different tests have been proposed for 
evaluating the quality of the fit of the Weibull distribution 
to the observed data. These tests are discussed and compared 
in detail in Lawless (1982). 
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Hollander-Proschan: This test compares the 
theoretical reliability function to the Kaplan-Meier estimate. 
The actual computations for this test are somewhat complex, 
and you may refer to Dodson for a detailed description of 
the computational formulas. The Hollander-Proschan test 
is applicable to complete, single-censored, and multiple- 
censored data sets; however, Dodson (1994) cautions that 
the test may sometimes indicate a poor fit when the data 
are heavily single-censored. The Hollander-Proschan C 
statistic can be tested against the normal distribution ( z ). 

Mann-Scheuer-Fertig: This test, proposed by Mann, 
Scheuer, and Fertig (1973), is described in detail in, for 
example, Dodson (1994) or Lawless (1982). The null 
hypothesis for this test is that the population follows the 
Weibull distribution with the estimated parameters. Nelson 
(1982) reports this test to have reasonably good power, and 
this test can be applied to Type II censored data. For 
computational details refer to Dodson (1994) or Lawless 
(1982); the critical values for the test statistic have been 
computed based on Monte Carlo studies, and have been 
tabulated for n (sample sizes) between 3 and 25. 

Anderson-Darling: The Anderson-Darling procedure 
is a general test to compare the fit of an observed cumulative 
distribution function to an expected cumulative distribution 
function. However, this test is only applicable to complete 
data sets The critical values for the Anderson-Darling 
statistic have been tabulated (see, for example, Dodson, 1994, 
Table 4.4) for sample sizes between 10 and 40; this test is 
not computed for n less than 10 and greater than 40. 

Interpreting Results 

Once a satisfactory fit of the Weibull distribution to the 
observed failure time data has been obtained, there are a 
number of different plots and tables that are of interest to 
understand the reliability of the item under investigation. 
If a good fit for the Weibull cannot be established, 
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distribution-free reliability estimates (and graphs) should 
be reviewed to determine the shape of the reliability function. 

Reliability plots: This plot will show the estimated 
reliability function along with the confidence limits. 



Note that nonparametric (distribution-free) estimates 
and their standard errors can also be computed and plotted. 

Hazard plots: As mentioned earlier, the hazard function 
describes the probability of failure during a very small time 
increment (assuming that no failures have occurred prior to 
that time). The plot of hazard as a function of time gives 
valuable information about the conditional failure 
probability. 

Percentiles of the reliability function: Based on the 
fitted Weibull distribution, one can compute the percentiles 
of the reliability (survival) function, along with the 
confidence limits for these estimates (for maximum likelihood 
parameter estimates). These estimates are particularly 
valuable for determining the percentages of items that can 
be expected to have failed at particular points in time. 
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Grouped Data 

In some cases, failure time data are presented in grouped 
form. Specifically, instead of having available the precise 
failure time for each observation, only aggregate information 
is available about the number of items that failed or were 
censored in a particular time interval. Such life-table data 
input is also described in the context of the Survival Analysis 
chapter. There are two general approaches for fitting the 
Weibull distribution to grouped data. 

First, one can treat the tabulated data as if they were 
continuous. In other words, one can “expand” the tabulated 
values into continuous data by assuming (1) that each 
observation in a given time interval failed exactly at the 
interval mid-point (interpolating out “half a step” for the 
last interval), and (2) that censoring occurred after the 
failures in each interval (in other words, censored 
observations are sorted after the observed failures). Lawless 
(1982) advises that this method is usually satisfactory if 
the class intervals are relatively narrow. 

Alternatively, you may treat the data explicitly as a 
tabulated life table, and use a weighted least squares 
methods algorithm (based on Gehan and Siddiqui, 1973; 
see also Lee, 1992) to fit the Weibull distribution (Lawless, 
1982, also describes methods for computing maximum 
likelihood parameter estimates from grouped data). 

Modified Failure Order for Multiple-Censored Data 

For multiple-censored data a weighted average ordered 
failure is calculated for each failure after the first censored 
data point. These failure orders are then used to compute 
the median rank, to estimate the cumulative distribution 
function. 

The modified failure order j is computed as (see Dodson 
1994): 
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I = ((/t+l)-0 ;J )/(l+c) 


where: 

I. is the increment for the j’th failure 

n is the total number of data points 

0 is the failure order of the previous observation (and 

P °r °p + 9 

c is the number of data points remaining in the data 
set, including the current data point 

The median rank is then computed as: 


F\t) = (/■ -0.3)/(h+0.4) 

where I- denotes the modified failure order, and n is the 
total number of observations. 


Weibull CDF, Reliability, and Hazard 

Density function. The Weibull distribution (Weibull, 
1939, 1951; see also Lieblein, 1955) has density function 
(for positive parameters 5, c, and Q): 


f(x) - c/b-'Kx-eyb]* 1 * e A {-[(x-@)/b] c } 

Q< x, b > 0, c > 0 

where 

b is the scale parameter of the distribution 

c is the shape parameter of the distribution 

0 is the location parameter of the distribution 

e is the base of the natural logarithm, sometimes 
called Euler’s e (2.71...) 


Cumulative distribution function (CDF): The 
V\ eibull distribution has the cumulative distribution function 
(for positive parameters 6, c, and 0): 


F(x) - 1 exp{ [(x 0)/b] c { 

using the same notation and symbols as described above for 
1 be density function. 
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Reliability function. The Weibull reliability function is 
the complement of the cumulative distribution function: 

R(x) = 1 - F(x) 

Hazard function. The hazard function describes the 
probability of failure during a very small time increment, 
assuming that no failures have occurred prior to that time. 
The Weibull distribution has the hazard function (for positive 
parameters b, c, and 0): 

h(t) = f(t)/R(t) - [c*(x- Q) (c l) ] / b° 

using the same notation and symbols as described above for 
the density and reliability functions. 

Cumulative hazard function: The Weibull distribution 
has the cumulative hazard function (for positive parameters 

b, c, and Q): 

II(t) = (x- 0) / b c 

using the same notation and symbols as described above foi 
the density and reliability functions. 




Survival/Failure Time Analysis 


These techniques were primarily developed in the medical 
and biological sciences, but they are also widely used in the 
socml and economic sciences, as well as in engineering 
(reliability and failure time analysis). Imagine that you are 
a researcher in a hospital who is studying the effectiveness 
of a new treatment for a generally terminal disease The 
major variable of interest is the number of days that the 
respective patients survive. In principle, one could use the 
standard parametric and nonparametric statistics for 
describing the average survival, and for comparing the new 
treatment with traditional methods However, at the end of 
the study there will be patients who survived over the entire 
study period, in particular among those patients who entered 
the hospital (and the research project) late in the study; 
there will be other patients with whom we will have lost 
contact. Surely, one would not want to exclude all of those 
patients from the study by declaring them to be missing 
data (since most of them are “survivors” and, therefore, 
they reflect on the success of the new treatment method). 
Those observations, which contain only partial information, 
are called censored observations 

Censored Observations 

In general, censored observations arise whenever the 
dependent variable of interest represents the time to a 
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terminal event, and the duration of the study is limited in 
time. Censored observations may occur in a number of 
different areas of research. For example, in the social 
sciences we may study the “survival” of marriages, high 
school dropout rates (time to drop-out), turnover in organi¬ 
zations, etc. In each case, by the end of the study period, 
some subjects will still be married, will not have dropped 
out, or are still working at the same company; thus, those 
subjects represent censored observations. In economics we 
may study the “survival” of new businesses or the “survival” 
times of products such as automobiles. In quality control 
research, it is common practice to study the “survival” of 
parts under stress (failure time analysis). 

Analytic Techniques 

Essentially, the methods offered in Survival Analysis 
address the same research questions as many of the other 
procedures; however, all methods in Survival Analysis will 
handle censored data. The life table, survival distribution , 
and Kaplan-Meier survival function estimation are all 
descriptive methods for estimating the distribution of 
survival times from a sample. Several techniques are 
available for comparing the survival in two or more groups. 
Finally, Survival Analysis offers several regression models 
for estimating the relationship of (multiple) continuous 
variables to survival times. 

Life Tabic Analysis 

The most straightforward way to describe the survival 
in a sample is to compute the Life Table. The life table 
technique is one of the oldest methods for analyzing survival 
(failure time) data. This table can be thought of as an 
“enhanced” frequency distribution table. The distribution of 
survival times is divided into a certain number of intervals. 
For each interval we can then compute the number and 
proportion of cases or objects that entered the respective 
interval “alive,” the number and proportion of cases that 
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foUed in the respective interval <U, number of terminal 
events or number of cases that “died”), and the number of 
cases that were lost or censored in the respective interval. 

Based on those numbers and proportions, several 
additional statistics can b© computed* 

Number of Cases at Risk: This is the number of cases 
that entered the respective interval alive, minus half of the 
number of cases lost or censored in the respective interval. 

Proportion Failing: This proportion is computed as 
the ratio of the number of cases failing in the respective 

interval, divided by the number of cases at risk in the 
interval. 

Proportion Surviving: This proportion is computed 
as 1 minus the proportion failing. 

Cumulative Proportion Surviving (Survival 
Function): This is the cumulative proportion of cases 
surviving up to the respective interval. Since the probabilities 
of survival are assumed to be independent across the 
intervals, this probability is computed by multiplying out 
the probabilities of survival across all previous intervals. 

The resulting function is also called the survivorship or 
survival function. 

Probability Density: This is the estimated probability 

of failure in the respective interval, computed per unit of 

time, that is: 

« 

F i = (W,) /hi 

In this formula, F { is the respective probability density 
m the i’th interval, P. is the estimated cumulative proportion 
surviving at the beginning of the i’th interval (at the end of 
intei y al i-1), P J+1 is the cumulative proportion surviving at 

e end of the i’th interval, and is the width of the 
respective interval. 
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a Rate- The hazard rate (the term was first used 
Bazard Ba • ed gs the proba bihty per tune unit 

by Barlow, 1963) survived to the beginning of the 

th at a caS i e ‘ " al till fail in that interval. Specifically, it 
respective int ber of failures per time units in the 

is computed a th the average number of 

„ .. o„rvi V al Time: This is the survival time at 
whicMhe cumulative survival function is equal to 0.5. Other 
percentiles (25th and 75th percent*) of the cumulative 
survival function can be computed accordingly. Note tha 
the 50th percentile (median) for the cumu ative survival 
function is usually not the same as the point ,n time up to 
which 50% of the sample survived. 


Required Sample Sizes: In order to arrive at reliable 
estimates of the three major functions (survival, probability 
density, and hazard) and their standard errors at each time 
interval the minimum recommended sample size is 30. 


distribution fitting 

General Introduction: In summary, the life table gives 
us a good indication of the distribution of failures over time. 
However, for predictive purposes it is often desirable to 
understand the shape of the underlying survival function 
in the population. The major distributions that have been 
proposed for modeling survival or failure times are the 
exponential (and linear exponential) distribution, the Weibull 
distribution of extreme events, and the Gompertz 
distribution. 

Estimation: The parameter estimation procedure (for 
estimating the parameters of the theoretical survival 
functions) is essentially a least squares linear regression 
algorithm. A linear regression algorithm can be used because 
all four theoretical distributions can be “made linear” by 
appropriate transformations. Such transformations 
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variancea «* «u«b « 

different times, leading to biased 

,. ,®^“ e8 ®' of ' Fit: Given the Parameters for the different 

^ ‘ he respective model . ^ can 
compute the bkebhood of the data. One can also compute 

the likdihood of the data under the null model, that is, a 

model that allows for different hazard rates in each interval. 

Without going into details, these two likelihoods can be 

compared via an incremental Chi-square test statistic. If 

this CAi-sqruare is statistically significant, then we conclude 

that the respective theoretical distribution fits the data 

significantly worse than the null model; that is, we reject 

the respective distribution as a model for our data. 


Plots: You can produce plots of the survival function, 
hazard, and probability density for the observed data and 
the respective theoretical distributions. These plots provide 
a quick visual check of the goodness-of-fit of the theoretical 
distribution. The example plot below shows an observed 
survivorship function and the fitted Weibull distribution. 
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Specifically, the three lines in this plot denote the 
theoretical distributions that resulted from three different 
estimation procedures (least squares and two methods of 
weighted least squares). 
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Kaplan-Meier Product-Limit Estimator 

Rather than classifying the observed survival times into 
a life table, we can estimate the survival function directly 
from the continuous survival or failure times. Intuitively 
imagine that we create a life table so that each time interval 
contains exactly one case. Multiplying out the survival 
probabilities across the “intervals” ( i.e ., for each single 
observation) we would get for the survival function: 


S(t) = n j t =1 [(n-j)/(n-j+l)]8 ( J ) 

In this equation, S(t) is the estimated survival function, 
n is the total number of cases, and II denotes the 
multiplication (geometric sum) across all cases less than or 
equal to t; 8 (j) is a constant that is either 1 if the /th case 
is uncensored (complete), and 0 if it is censored. This 
estimate of the survival function is also called the product- 
limit estimator , and was first proposed by Kaplan and Meier 
(1958). An example plot of this function is shown below. 



The advantage of the Kaplan-Meier Product-Limit 
method over the life table method for analyzing survival 
and failure time data is that the resulting estimates do not 
epend on the grouping of the data (into a certain number 


* 
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of time intervals). Actually, the Product-Limit method and 
the life table method are identical if the intervals of the life 
table contain at most one observation. 

COMPARING SAMPLES 

General Introduction: One can compare the survival 
01 failure times in two or more samples. In principle, because 
survival times are not normally distributed, nonparametric 
tests that are based on the rank ordering of survival times 
should be applied. A wide range of nonparametric tests can 
be used in order to compare survival times; however, the 
tests cannot handle” censored observations. 

Available Tests: The following five different (mostly 
nonparametric) tests for censored data are available: Gehan’s 
geneialized Wilcoxon test, the Cox-Mantel test, the Cox’s F 
test, the log-rank test, and Peto and Peto’s generalized 
Wilcoxon test. A nonparametric test for the comparison of 
multiple groups is also available. Most of these tests are 
accompanied by appropriate 2 - values (values of the standard 
normal distribution); these 2 -values can be used to test for 
the statistical significance of any differences between groups. 
However, note that most of these tests will only yield reliable 
results with fairly large samples sizes; the small sample 
“behavior” is less well understood. 

Choosing a Two-Sample Test: There are no widely 
accepted guidelines concerning which test to use in a 
particular situation. Cox’s F test tends to be more powerful 
than Gehan’s generalized Wilcoxon test when: 

Sample sizes are small 11 per group less than 
50); 

If samples are from an exponential or Weibull; 

• If there are no censored observations (see Gehan & 
Thomas, 1969). 

Lee, Desu, and Gehan (1975) compared Gehan’s test to 
several alternatives and showed that the Cox-Mantel test 
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' j the log-rank test are more powerful (regardless of 
3,1 „ when the samples are drawn from a population 

TTf Hols an exponential or Weibull distribution; under 
that foll ° , ere ig mtle difference between the Cox- 

Mantei°test and the log-rank test Lee (1980) discusses the 
power of different tests in greater detail. 

Multiple Sample Test: There is a multiple-sample test 
that is an extension (or generalization) of Gehan s 
generalized Wilcoxon test, Peto and Peto s generalized 
Wilcoxon test, and the log-rank test. First, a score is assigned 
to each survival time using Mantel’s procedure (Mantel, 
1967)- next a Chi- square value is computed based on the 
sums’(for each group) of this score. If only two groups are 
specified, then this test is equivalent to Gehan’s generalized 
Wilcoxon test, and the computations will default to that 
test in this case. 


Unequal Proportions of Censored Data: When 
comparing two or more groups it is very important to 
examine the number of censored observations in each group. 
Particularly in medical research, censoring can be the result 
of, for example, the application of different treatments: 
patients who get better faster or get worse as the result of a 
treatment may be more likely to drop out of the study, 
resulting in different numbers of censored observations in 
each group. Such systematic censoring may greatly bias the 
results of comparisons. 


REGRESSION MODELS 
General Introduction 


A common research question in medical, biological, or 
engineering (failure time) research is to determine whether 
01 not ceitain continuous (independent) variables are 
conelated with the survival or failure times. There are two 
ajoi xeasons why this research issue cannot be addressed 
tiaightfoiwaid multiple regression techniques. First, 
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the dependent variable of interest (survival/failure time) is 
most likely not normally distributed---a serious violation of 
an assumption for ordinary least squares multiple 
regressions. Survival times usually follow an exponential or 
ei u distribution. Second, there is the problem of 
censoring, that is, some observations will be incomplete. 

Cox’s Proportional Hazard Model 

The proportional hazard model is the most general of 

t e regression models because it is not based on any 

assumptions concerning the nature or shape of the 

un ei ying survival distribution. The model assumes that 

e un erlying hazard rate (rather than survival time) is a 

function of the independent variables (covariates); no 

assumptions are made about the nature or shape of the 

azard function. Thus, in a sense, Cox’s regression model 

may be considered to be a nonparametric method. The model 
may be written as: 


h<(t), ( z i> z 2 > •••» zj} - h 0 (t)*exp(b 1 *z 1 + ... + b m *zj 

where h(t,...) denotes the resultant hazard, given the values 

of the m covariates for the respective case (z z „ . 2 ) and 

the respective survival time (t). The term hjt) is called the 
baseline hazard-, it is the hazard for the respective individual 
when all independent variable values are equal to zero We 
can linearize this model by dividing both sides of the 

equation by h 0 (t) and then taking the natural logarithm of 
both sides: 


log[h{(t), (z...)}/h 0 (t)] = b*z. + ... + b *Z 

11 m in 

We now have a fairly “simple” linear model that can be 
readily estimated. 

Assumptions: While no assumptions are made about 
the shape of the underlying hazard function, the model 
equations shown above do imply two assumptions. First 
they specify a multiplicative relationship between the 
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underlying hazard function and the log-linear function of 
the covariates. This assumption is also called the 
proportionality assumption. In practical terms, it is assumed 
that, given two observations with different values for the 
independent variables, the ratio of the hazard functions for 
those two observations does not depend on time. The second 
assumption of course, is that there is a log-linear relationship 
between the independent variables and the underlying 
hazard function. 


Cox’s Proportional Hazard Model with Time- 
Dependent Covariates 

An assumption of the proportional hazard model is that 
the hazard function for an individual (i.e., observation in 
the analysis) depends on the values of the covariates and 
the value of the baseline hazard. Given two individuals 
with particular values for the covariates, the ratio of the 
estimated hazards over time will be constant—hence the 
name of the method: the proportional hazard model. The 
validity of this assumption may often be questionable. For 
example, age is often included in studies of physical health. 
Suppose you studied survival after surgery. It is likely, that 
age is a more important predictor of risk immediately after 
surgery, than some time after the surgery (after initial 
recovery). In accelerated life testing one sometimes uses a 
stress covariate ( e.g ., amount of voltage) that is slowly 
increased over time until failure occurs (e.g., until the 
electrical insulation fails; see Lawless, 1982, page 393). In 
this case, the impact of the covariate is clearly dependent 
on time. The user can specify arithmetic expressions to define 
covariates as functions of several variables and survival 
time. 

Testing the Proportionality Assumption: As 

indicated by the previous examples, there are many 
applications where it is likely that the proportionality 
assumption does not hold. In that case, one can explicitly 
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define covariates as functions of time. For example, the 
analysis of a data set presented by Pike (1966) consists of 
survival times for two groups of rats that had been exposed 
to a carcinogen (see also Lawless, 1982, page 393, for a 
similar example). Suppose that z is a grouping variable 
with codes 1 and 0 to denote whether or not the respective 
rat was exposed. One could then fit the proportional hazard 
model: 

h(t,z) = h^t^explb^z + b 2 *[z*log(t)-5.4]} 

Thus, in this model the conditional hazard at time t is a 
function of (1) the baseline hazard h 0 , (2) the covariate z , 
and (3) of z times the logarithm of time. Note that the 
constant 5.4 is used here for scaling purposes only: the 
mean of the logarithm of the survival times in this data set 
is equal to 5.4. In other words, the conditional hazard at 
each point in time is a function of the covariate and time; 
thus, the effect of the covariate on survival is dependent on 
time; hence the name time-dependent covariate. This model 
allows one to specifically test the proportionality assumption. 
If parameter b 2 is statistically significant (e.g., if it is at 
least twice as large as its standard error), then one can 
conclude that, indeed, the effect of the covariate z on survival 
is dependent on time, and, therefore, that the proportionality 
assumption does not hold. 

Exponential Regression 

Basically, this model assumes that the survival time 
distribution is exponential, and contingent on the values of 
a set of independent variables ( zi ). The rate parameter of 
the exponential distribution can then be expressed as: 

S(z) = exp (a + bj*Zj + b 2 *z 2 + ... + b m *z m ) 

S(z) denotes the survival times, a is a constant, and the 
b-s are the regression parameters. 
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„ The Chi-square goodness-of-fit value 

G °° i pH as a function of the log-likelihood for the model 
18 "Hu parameter estimates (LI), and the log-likelihood of 
lie model in which all covariates are forced to 0 (zero; LO). 
If this Chi-square value is significant, we reject the null 
hypothesis and assume that the independent variables are 
significantly related to survival times. 

Standard exponential order statistic: One way to 
check the exponentiality assumption of this model is to plot 
the residual survival times against the standard exponential 
order statistic theta. If the exponentiality assumption is 
met, then all points in this plot will be arranged roughly in 

a straight line. 


Normal and Lognormal Regression 

In this model, it is assumed that the survival times (or 
log survival times) come from a normal distribution,* the 
resulting model is basically identical to the ordinary multiple 
regression model, and may be stated as: 

t = a + b,*z 1 + i> 2 *z 2 + ... + b m *z m 

where t denotes the survival times. For lognormal regression, 

t is replaced by its natural logarithm. The normal regression 

model is particularly useful because many data sets can be 

transformed to yield approximations of the normal 

distribution. Thus, in a sense this is the most general fully 

parametric model (as opposed to Cox’s proportional hazard 

model which is non-parametric), and estimates can be 

obtained for a variety of different underlying survival 
distributions. 

Goodness-of-fit: The Chi-square value is computed as 
a function of the log-likelihood for the model with all 
“ V i ar ^ a ^ , ^ es (Ll), and the log-likelihood of the 
(zero LO) ^ ^ * nc ^ e P en( lent variables are forced to 0 
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Stratified Analyses 

The purpose of a stratified analysis is to test the 
hypothesis whether identical regression models are 
appropriate for different groups, that is, whether the 
relationships between the independent variables and 
survival are identical in different groups. To perform a 
stratified analysis, one must first fit the respective regression 
model separately within each group. The sum of the log- 
likelihoods from these analyses represents the log-likelihood 
of the model with different regression coefficients (and 
intercepts where appropriate) in different groups. The next 
step is to fit the requested regression model to all data in 
the usual manner (i.e., ignoring group membership), and 
compute the log-likelihood for the overall fit. The difference 
between the log-likelihoods can then be tested for statistical 
significance. 
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Quality Control Charts 


In all production processes, we need to monitor the extent 
to which our products meet specifications. In the most 
general terms, there are two “enemies” of product quality: 
(1) deviations from target specifications, and (2) excessive 
variability around target specifications. During the earlier 
stages of developing the production process, designed 
experiments are often used to optimize these two quality 
characteristics; the methods provided in Quality Control 
are on-line or in-process quality control procedures to monitor 
an on-going production process. 

General Approach 

The genera] approach to on-line quality control is 
straightforward: We simply extract samples of a certain 
size from the ongoing production process. We then produce 
line charts of the variability in those samples, and consider 
their closeness to target specifications. If a trend emerges 
in those lines, or if samples fall outside pre-specified limits, 
then we declare the process to be out of control and take 
action to find the cause of the problem. These types of charts 
are sometimes also referred to as Shewhart control charts 

Interpreting the chart: The most standard display 
actually contains two charts (and two histograms); one is 
called an X-bar chart , the other is called an R chart. 
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In both line charts, the horizontal axis represents the 
different samples) the vertical axis for the X-bar chart 
represents the means for the characteristic of interest; the 
vertical axis for the R chart represents the ranges. p or 
example, suppose we wanted to control the diameter of piston 
rings that we are producing. The center line in the X-bar 
chart would represent the desired standard size ( e.g, } 
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diameter in millimeters) of the rings, while the center line 
in the R chart would represent the acceptable (within- 
specification) range of the rings within samples; thus, this 
latter chart is a chart of the variability of the process (the 
larger the variability, the larger the range). In addition to 
the centerline, a typical chart includes two additional 
horizontal lines to represent the upper and lower control 
limits ( UCL , LCL, respectively); we will return to those 
lines shortly. Typically, the individual points in the chart, 
representing the samples, are connected by a line. If this 
line moves outside the upper or lower control limits or 
exhibits systematic patterns across consecutive samples, then 
a quality problem may potentially exist. 


Establishing Control Limits 

Even though one could arbitrarily determine when to 
declare a process out of control (that is, outside the UCL- 
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range), it is common practice to apply statistical 
princip es to do so. Elementary Concepts discusses the 
concept of the sampling distribution , and the characteristics 
o t e noi trial distribution. The method for constructing the 
upper and lower control limits is a straightforward 
application of the principles described there. 

Example. Suppose we want to control the mean of a 
variable, such as the size of piston rings. Under the 
assumption that the mean (and variance) of the process 
does not change, the successive sample means will be 
istri uted normally around the actual mean. Moreover, 
wit out going into details regarding the derivation of this 
formula, we also know (because of the central limit theorem, 
an t us approximate normal distribution of the means; 
see, for example, Hoyer and Ellis, 1996) that the distribution 
of sample means will have a standard deviation of Sigma 
(the standard deviation of individual data points or 
measurements) over the square root of n (the sample size). 
It follows that approximately 95% of the sample means will 

fall within the limits 1.96 * Sigma/Square Root(n) (refer 

to Elementary Concepts for a discussion of the characteristics 
of the normal distribution and the central limit theorem). 
In practice, it is common to replace the 1.96 with 3 (so that 
the interval will include approximately 99% of the sample 
means), and to define the upper and lower control limits as 
plus and minus 3 sigma limits, respectively. 

General case: The general principle for establishing 
control limits just described applies to all control charts. 
After deciding on the characteristic we want to control, for 
example, the standard deviation, we estimate the expected 
variability of the respective characteristic in samples of the 
size we are about to take. Those estimates are then used to 
establish the control limits on the chart. 

Common Types of Charts 

The types of charts are often classified according to the 
Wpe of quality characteristic that they are supposed to 
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monitor: there are quality control charts for variables and 
control charts for attributes. Specifically, the following charts 
are commonly constructed for controlling variables: 


X-bar chart: In this chart the sample means are plotted 
in order to control the mean value of a variable (e.g. } size of 
piston rings, strength of materials, etc.). 


R chart: In this chart, the sample ranges are plotted in 
order to control the variability of a variable. 

S chart: In this chart, the sample standard deviations 
are plotted in order to control the variability of a variable. 

S2 chart: In this chart, the sample variances are plotted 
in order to control the variability of a variable. 

For controlling quality characteristics that represent 
attributes of the product, the following charts are commonly 
constructed: 


C chart. In this chart (see example below), we plot the 
number of defectives (per batch, per day, per machine, per 
100 feet of pipe, etc.). This chart assumes that defects of 
the quality attribute are rare , and the control limits in this 
chart are computed. 
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U chart: In this chart we pl ot th 
that is, the number of defectives divided h*? defecti '**, 
units inspected (the n; e.g., f ee t 0 f „i np „ , the n «mber of 
Unlike the C chart, this chart does not Umber of itches), 
number of units, and it can be used fo r6quire a con stant 
batches (samples) are of different sizes* eXample ’ when th e 

Np chart: In this chart, we nlnt 

(per batch, per day, per machine) as in th°c Ztj! fectives 
the control limits in this chart are art - However, 

distribution of rare events but rathe* 10 , a , Sed , on the 

distribution Therefore, to chart sho^hTbe usedTthe 

occurrence of defectives is not rare U a 

(tla „ co/ .. . rare they occur m more 

han 5 /. of the units inspected). For example, we may use 

this chart to control the number of units produced with 
minor flaws. 


P chart: In this chart, we plot the percent of defectives 
(per batch, per day, per machine, etc.) as in the U chart. 
However, the control limits in this chart are not based on 
the distribution of rare events but rather on the binomial 
distribution (of proportions). Therefore, this chart is most 
applicable to situations where the occurrence of defectives 
is not rare (e.g., we expect the percent of defectives to be 
more than 5% of the total number of units produced). 

All of these charts can be adapted for short production 
runs (short run charts), and for multiple process streams 
(multiple stream group charts). 


Short Run Charts 

The short run control chart, or control chart for short 
production runs, plots observations of variables or attributes 
for multiple parts on the same chart. Short run control 
charts were developed to address the requirement that 
several dozen measurements of a process must be collected 
before control limits are calculated. Meeting this requnement 
is often difficult for operations that produce a limited number 
of a particular part during, adduction run, - 
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For example, a paper mill may produce only three or 
four (huge) rolls of a particular kind of paper (i.e., part) and 
then shift production to another kind of paper. But if 
variables, such as paper thickness, or attributes, such as 
blemishes, are monitored for several dozen rolls of paper of 
say, a dozen different kinds, control limits for thickness 
and blemishes could be calculated for the transformed. 
(within the short production run) variable values of interest. 
Specifically, these transformations will rescale the variable 
values of interest such that they are of compatible 
magnitudes across the different short production runs (or 
parts). The control limits computed for those transformed 
values could then be applied in monitoring thickness, and 
blemishes, regardless of the types of paper (parts) being 
produced. Statistical process control procedures could be 
used to determine if the production process is in control, to 
monitor continuing production, and to establish procedures 
for continuous quality improvement. 


Short Run Charts for Variables 

Nominal chart, target chart: There are several 
different types of short run charts. The most basic are the 
nominal short run chart, and the target short run chart. In 
these charts, the measurements for each part are 
transformed by subtracting a part-specific constant. These 
constants can either be the nominal values for the respective 
parts (nominal short run chart), or they can be target values 
computed from the (historical) means for each part (Target 
X-bar and. R chart). For example, the diameters of piston 
bores for different engine blocks produced in a factory can 
only be meaningfully compared (for determining the 
consistency of bore sizes) if the mean differences between 
bore diameters for different sized engines are first removed. 
The nominal or target short run chart makes such 
comparisons possible. Note that for the nominal or target 
chart it is assumed that the variability across parts is 
identical, so that control limits based on a common estimate 
of the process sigma are applicable. 
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Standardized short run chart: If the variability of 
the process for different parts cannot be assumed to be 
identical then a further transformation is necessary before 
the sample means for different parts can be plotted in the 
same chart. Specifically, in the standardized short run chart 
the plot points are further transformed by dividing the 
deviations of sample means from part means (or nominal or 
target values for parts) by part-specific constants that are 
proportional to the variability for the respective parts. For 
example, for the short run X-bar and R chart, the plot points 
(that are shown in the X-bar chart) are computed by first 
subtracting from each sample mean a part specific constant 
(e.g., the respective part mean, or nominal value for the 
respective part), and then dividing the difference by another 
constant, for example, by the average range for the respective 
chart. These transformations will result in comparable scales 
for the sample means for different parts. 

Short Run Charts for Attributes 

For attribute control charts (C, U, Np, or P charts), the 
estimate of the variability of the process (proportion, rate, 
etc.) is a function of the process average (average proportion, 
rate, etc.; for example, the standard deviation of a proportion 
p is equal to the square root of p*(l- p)/n). Hence, only 
standardized short run charts are available for attributes. 
For example, in the short run P chart, the plot points are 
computed by first subtracting from the respective sample p 
values the average part p’s, and then dividing by the 
standard deviation of the average p’s. 

Multiple Stream Group Charts 

The group control chart plots multiple streams of 
observations or attributes on the same chart. This simplifies 
interpretation when monitoring many process streams or 
characteristics. Process streams may be different machines, 
assembly lines, operators, or the like. All of these may be 
plotted on a single group chart. 
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Fra group X-bar chart, two points are plotted for each 
f the* samples for which measurements are collected, 
producing two plotted lines across samples. The upper li ne 
v lot of the highest mean values from the multiple 
streams or attributes measured for each of the samples, 
and the lower line is a plot of the lowest mean values from 
the multiple streams or attributes for each of the samples. 
These upper and lower plotted points represent the 
maximum and minimum mean values across the multiple 
streams or attributes for each sample, and if these extreme 
values are within the specified control limits, the® obviously 
all other mean values are also within the control limits. 
The group X-bar chart, therefore, allows one to quickly 
determine whether many process streams or characteristics 
are under control without necessarily inspecting each and 
every measurement. 

For group R-bar, S, or S**2 charts for variables, or for 
group C, U, Np, or P charts for attributes, the two points 
that are plotted for each sample are the respective maximum 
and minimum ranges, standard deviations, etc., from the 
multiple streams or attributes measured for each sample. 
As with the group X-bar chart, comparison of these extreme 
values with the specified control limits allows one to quickly 
determine whether the multiple process streams or 
characteristics are under control. 


A group chart for a single part is referred to as a 
standaid group chart, or simply a group chart, as it is 
commonly called. Group charts for multiple parts are referred 
to as group shoit run charts. The same procedure is used 
f P 10 uc ^ n £ & rou P short run charts as for producing 
r rV r0U r hartS: the on ’y difference for group short 
Short runV s th f ^ plot points are determined after all 
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Unequal Sample Sizes 

When the samples plotted in the control chart are not of 
equa size, then the control limits around the center line 
(target specification) cannot be represented by a straight 
line For example, to return to the formula Sigma/Square 

P resented earlier for computing control limits for 
the X-bar chart, it is obvious that unequal n’s will lead to 
i erent control limits for different sample sizes. There are 
three ways of dealing with this situation. 

Average sample size: If one wants to maintain the 
straight-line control limits (e.g., to make the chart easier to 
read and easier to use in presentations), then one can 
compute the average n per sample across all samples, and 
establish the control limits based on the average sample 
size. This procedure is not “exact,” however, as long as the 

sample sizes are reasonably similar to each other, this 
procedure is quite adequate. 

Variable control limits: Alternatively, one may 
compute different control limits for each sample, based on 
the respective sample sizes. This procedure will lead to 
variable control limits, and result in step-chart like control 
lines in the plot. This procedure ensures that the correct 
control limits are computed for each sample. However, one 
loses the simplicity of straight-line control limits. 

Stabilized (normalized) chart: The best of two worlds 
(straight line control limits that are accurate) can be 
accomplished by standardizing the quantity to be controlled 
(mean, proportion, etc.) according to units of sigma. The 
control limits can then be expressed in straight lines, while 
the location of the sample points in the plot depend not only 
on the characteristic to be controlled, but also on the 
respective sample n’s. The disadvantage of this procedure is 
that the values on the vertical (Y) axis in the control chart 
are in terms of sigma rather than the original units of 
measurement, and therefore, those numbers cannot be taken 
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at face value (e.g., a sample with a value of 3 is 3 times 
sigma away from specifications; in order to express the value 
of this sample in terms of the original units of measurement 
we need to perform some computations to convert this 
number back). 

Control Charts for Variables vs. Charts for Attributes 

Sometimes, the quality control engineer has a choice 
between variable control charts and attribute control charts. 

Advantages of attribute control charts: Attribute 
control charts have the advantage of allowing for quick 
summaries of various aspects of the quality of a product, 
that is, the engineer may simply classify products as 
acceptable or unacceptable, based on various quality criteria. 
Thus, attribute charts sometimes bypass the need for 
expensive, precise devices and time-consuming measurement 
procedures. Also, this type of chart tends to be more easily 
understood by managers unfamiliar with quality control 
procedures; therefore, it may provide more persuasive (to 
management) evidence of quality problems. 

Advantages of variable control charts. Variable control 

charts are more sensitive than attribute control charts (see 

Montgomery, 1985, p. 203). Therefore, variable control charts 

may alert us to quality problems before any actual 

unacceptables (as detected by the attribute chart) will 

occur. Montgomery (1985) calls the variable control charts 

leading indicators of trouble that will sound an alarm before 

the number of rejects (scrap) increases in the production 
process. 

Control Chart for Individual Observations 

Variable control charts can by constructed for individual 
observations taken from the production line, rather than 
samp es o obseivations. This is sometimes necessary when 
mg samples of multiple observations would be too 

nirmV\ S1V ^p * nconven ient, or impossible. For example, the 
r o customer complaints or product returns may only 
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be available on a monthly basis; yet, one would like to chart 
those numbers to detect quality problems. Another common 
application of these charts occurs in cases when automated 
testing devices inspect every single unit that is produced. 

In that case, one is often primarily interested in detecting 
small shifts in the product quality (for example, gradual 
deterioration of quality due to machine wear). 

Out-of-Control Process: Runs Tests 

As mentioned earlier in the introduction, when a sample 
point ( e.g ., mean in an X-bar chart) falls outside the control 
lines, one has reason to believe that the process may no 
longer be in control. In addition, one should look for 
systematic patterns of points (e.g., means) across samples, 
because such patterns may indicate that the process average 
has shifted. These tests are also sometimes referred to as 
AT&T runs rules (see AT&T, 1959) or tests for special causes. 
The term special or assignable causes as opposed to chance 
or common causes was used by Shewhart to distinguish 
between a process that is in control, with variation due to 
random (chance) causes only, from a process that is out of 
control, with variation that is due to some non-chance or 
special ( assignable ) factors. 

As the sigma control limits discussed earlier, the runs 
rules are based on “statistical” reasoning. For example, the 
probability of any sample mean in an X-bar control chart 
falling above the center line is equal to 0.5, provided 
(1) that the process is in control (i.e., that the center line 
value is equal to the population mean), (2) that consecutive 
sample means are independent (i.e., not auto-correlated), 
and (3) that the distribution of means follows the normal 
distribution. Simply stated, under those conditions there is 
a 50-50 chance that a mean will fall above or below the 
centerline. Thus, the probability that two consecutive means 
will fall above the centerline is equal to 0.5 times 0.5 
= 0.25. 



Accordingly, the probability that 9 consecutive samples 
(or a run of 9 samples) will fall on the same side of the 
center line is equal to 0.5**9 = .00195. Note that this ig 
approximately the probability with which a sample mean 
can be expected to fall outside the 3-times sigma limits 
(given the normal distribution, and a process in control). 
Therefore, one could look for 9 consecutive sample means 
on the same side of the centerline as another indication of 
an out-of-control condition. Refer to Duncan (1974) for details 
concerning the “statistical” interpretation of the other 
(more complex) tests. 


Zone A, B, C: Customarily, to define the runs tests, the 
area above and below the chart center line is divided into 
three “zones.” 



By default, Zone A is defined as the area between 2 and 
3 times sigma above and below the center line; Zone B is 
defined as the area between 1 and 2 times sigma , and Zone 
C is defined as the area between the center line and 1 times 
sigma. 


9 points in Zone C or beyond (on one side of 
central line): If this test is positive (i.e., if this pattern is 
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detected), then the process average has probably changed. 
Note that it is assumed that the distribution of the respective 
quality characteristic in the plot is symmetrical around the 
mean. This is, for example, not the case for R charts, S 
charts, or most attribute charts. However, this is still a 
useful test to alert the quality control engineer to potential 
shifts in the process. For example, successive samples with 
less-than-average variability may be worth investigating, 
since they may provide hints on how to decrease the 
variation in the process. 

6 points in a row steadily increasing or decreasing: 
This test signals a drift in the process average. Often, such 
drift can be the result of tool wear, deteriorating 
maintenance, improvement in skill, etc. (Nelson, 1985). 

14 points in a row alternating up and down: If this 
test is positive, it indicates that two systematically 
alternating causes are producing different results. For 
example, one may be using two alternating suppliers, or 
monitor the quality for two different (alternating) shifts. 

2 out of 3 points in a row in Zone A or beyond: 
This test provides an “early warning” of a process shift. 
Note that the probability of a false positive (test is positive 
but process is in control) for this test in X-bar charts is 
approximately 2%. 

4 out of 5 points in a row in Zone B or beyond: 

Like the previous test, this test may be considered to be an 
“early warning indicator” of a potential process shift. The 
false- positive error rate for this test is also about 2%. 

15 points in a row in Zone C (above and below the 
center line): This test indicates a smaller variability than 
is expected (based on the current control limits). 

8 points in a row in Zone B, A, or beyond, on 
either side of the center line (without points in Zone 
C): This test indicates that different samples are affected 
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by different factors, resulting in a bimodal distribution 0 f 
means. This may happen, for example, if different samples 
in an X-bar chart where produced by one of two different 
machines, where one produces above average parts, and the 

other below 

average parts. 

Operating Characteristic (OC) Curves 

A common supplementary plot to standard quality 
control charts is the so-called operating characteristic or OC 
curve (see example below). One question that comes to mind 
when using standard variable or attribute charts is how 
sensitive is the current quality control procedure? Put in 
more specific terms, how likely is it that you will not find a 
sample (e.g., mean in an X-bar chart) outside the control 
limits (i.e.y accept the production process as “in control”), 
when, in fact, it has shifted by a certain amount? This 
probability is usually referred to as the P (beta) error 
probability, that is, the probability of erroneously accepting 
a process (mean, mean proportion, mean rate defectives, 
etc.) as being “in control.” Note that operating characteristic 
curves pertain to the false-acceptance probability using the 
sample-outside-of- control-limits criterion only, and not the 
runs tests described earlier. 
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Operating characteristic curves are extremely useful for 
exploring the power of our quality control procedure. The 
actual decision concerning sample sizes should depend not 
only on the cost of implementing the plan ( e.g ., cost per 
item sampled), but also on the costs resulting from not 
detecting quality problems. The OC curve allows the 
engineer to estimate the probabilities of not detecting s hift s 
of certain sizes in the production quality. 

Process Capability Indices 

For variable control charts, it is often desired to include 
so-called process capability indices in the summary graph. 
In short, process capability indices express (as a ratio) the 
proportion of parts or items produced by the current process 
that fall within user-specified limits (e.g., engineering 
tolerances). 

For example, the so-called Cp index is computed as: 

C p = (USL-LSL)/(6*sigma) 

where sigma is the estimated process standard deviation, 
and USL and LSL are the upper and lower specification 
(engineering) limits, respectively. If the distribution of the 
respective quality characteristic or variable (e.g., size of 
piston rings) is normal, and the process is perfectly centered 
(i.e., the mean is equal to the design center), then this 
index can be interpreted as the proportion of the range of 
the standard normal curve (the process width) that falls 
within the engineering specification limits. If the process is 
not centered, an adjusted index C k is used instead. 

For a “capable’' process, the C p index should be greater 
than 1, that is, the specification limits would be larger than 
6 times the sigma limits, so that over 99% of all items or 
parts produced could be expected to fall inside the acceptable 
engineering specifications. 
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Other Specialized Control Charts 


The types of control charts mentioned so far are the 
“workhorses” of quality control, and they are probably the 
most widely used methods. However, with the advent of 
inexpensive desktop computing, procedures requiring more 
computational effort have become increasingly popular. 


X-bar Charts For Non-Normal Data: The control 
limits for standard X-bar charts are constructed based on 
the assumption that the sample means are approximately 
normally distributed. Thus, the underlying individual 
observations do not have to be normally distributed, since, 
as the sample size increases, the distribution of the means 
will become approximately normal ( i.e ., see discussion of 
the central limit theorem in the Elementary Concepts; 
however, note that for R, S[ and S**2 charts, it is assumed 
that the individual observations are normally distributed). 
Shewhart (1931) in his original work experimented with 
various non-normal distributions for individual observations, 
and evaluated the resulting distributions of means for 
samples of size four. He concluded that, indeed, the standard 
normal distribution-based control limits for the means are 
appropriate, as long as the underlying distribution of 
observations are approximately normal. 


However, as Ryan (1989) points out, when the 
distribution of observations is highly skewed and the sample 
sizes are small, then the resulting standard control limits 
may produce a large number of false alarms (increased alpha 
error rate), as well as a larger number of false negative 
(“process-is-in-control”) readings (increased beta-error rate). 
You can compute control limits (as well as process capability 
indices) for X-bar charts based on so-called Johnson curves 
(Johnson, 1949), which allow approximating the skewness 
and kurtosis for a large range of non-normal distributions. 
These non- normal X-bar charts are useful when the 
distribution of means across the samples is clearly skewed, 
or otherwise non-normal. 
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Cumulative Sum (CUSUM) Chart: The CUSUM chart 
was first introduced by Page (1954); the mathematical 
principles involved in its construction are discussed in Ewan 
(1963), Johnson (1961), and Johnson and Leone (1962). 
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^ one plots the cumulative sum of deviations of 
successive sample means from a target specification, even 
minor, permanent shifts in the process mean will eventually 
lead to a sizable cumulative sum of deviations. Thus, this 
chart is particularly well suited for detecting such small 
permanent shifts that may go undetected when using the 
X-bar chart. For example, if, due to machine wear, a process 
slowly “slides” out of control to produce results above target 
specifications, this plot would show a steadily increasing 
(or decreasing) cumulative sum of deviations from 
specification. 

To establish control limits in such plots, Barnhard (1959) 
proposed the so-called V- mask, which is plotted after the 
last sample (on the right). The V-mask can be thought of as 
the upper and lower control limits for the cumulative sums. 
However, rather than being parallel to the centerline; these 
lines converge at a particular angle to the right, producing 
the appearance of a V rotated on its side. If the line 
representing the cumulative sum crosses either one of the 
two lines, the process is out of control. 

Moving Average (MA) Chart: To return to the piston 
ring example, suppose we are mostly interested in detecting 
small trends across successive sample means. For example, 
we may be particularly concerned about machine wear, 
leading to a slow but constant deterioration of quality ( i.e ., 
deviation from specification). The CUSUM chart described 
above is one way to monitor such trends, and to detect 
small permanent shifts in the process average. Another way 
is to use some weighting scheme that summarizes the means 
of several successive samples; moving such a weighted mean 
across the samples will produce a moving average chart (as 
shown in the following graph). 




Exponentially-weighted Moving Average (EWMA) 
Chart: The idea of moving averages of successive (adjacent) 
samples can be generalized. In principle, in order to detect 
a trend we need to weight successive samples to form a 
moving average; however, instead of a simple arithmetic 
moving average, we could compute a geometric moving 
average (this chart (see graph below) is also called Geometric 
Moving Average chart, see Montgomery, 1985, 1991). 
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Specifically, we could compute each data point for the 
plot as: 

z t = X *x-bar t + (1-A,)*z t 2 

In this formula, each point z ( is computed as X (lambda) 
times the respective mean x-bar r plus one minus X times 
the previous (computed) point in the plot. The parameter X 
(lambda) here should assume values greater than 0 and 
less than 1. Without going into detail (see Montgomery, 
1985, p. 239), this method of averaging specifies that the 
weight of historically “old” sample means decreases 
geometrically as one continues to draw samples. The 
interpretation of this chart is much like that of the moving 
average chart, and it allows us to detect small shifts in the 
means, and, therefore, in the quality of the production 
process. 

Regression Control Charts: Sometimes we want to 
monitor the relationship between two aspects of our 
production process. For example, a post office may want to 
monitor the number of worker-hours that are spent to 
process a certain amount of mail. These two variables should 
roughly be linearly correlated with each other, and the 
relationship can probably be described in terms of the well- 
known Pearson product-moment correlation coefficient r. 
This statistic is also described in Basic Statistics. The 
regression control chart contains a regression line that 
summarizes the linear relationship between the two 
vaiiables of interest. The individual data points are also 
shown in the same graph. Around the regression line we 
establish a.confidence interval within which we would expect 
a certain proportion (e.g., 95%) of samples to fall. Outliers 
in this plot may indicate samples where, for some reason, 
the common relationship between the two variables of 
interest does not hold. 
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Applications: There are many useful applications for 
the regression control chart. For example, professional 
auditors may use this chart to identify retail outlets with a 
greater than expected number of cash transactions given 
the overall volume of sales, or grocery stores with a greater 
than expected number of coupons redeemed, given the total 
sales. In both instances, outliers in the regression control 
charts (e.g., too many cash transactions; too many coupons 
redeemed) may deserve closer scrutiny. 
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Pareto Chart Analysis: Quality problems are rarely 
spread evenly across the different aspects of the production 
process or different plants. Rather, a few “bad apples” often 
account for the majority of problems. This principle has 
come to be known as the Pareto principle, which basically 
states that quality losses are mal-distributed in such a way 
that a small percentage of possible causes are responsible 
for the majority of the quality problems. For example, a 
relatively small number of “dirty” cars are probably 
responsible for the majority of air pollution; the majority of 
losses in most companies result from the failure of only one 
or two products. To illustrate the “bad apples”, one plots 
the Pareto chart, which simply amounts to a histogram 
showing the distribution of the quality loss ( e.g dollar loss) 
across some meaningful categories; usually, the categories 
are sorted into descending order of importance (frequency, 
dollar amounts, etc.). Very often, this chart provides useful 
guidance as to where to direct quality improvement efforts. 
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Representative Visualization 

Techniques 


Categorized Graphs 

One of the most important, general, and also powerful analytic 

methods involves dividing (“splitting”) the data set into 

categories in order compare the patterns of data between the 

resulting subsets. This common technique is known under a 

variety of terms (such as breaking down, grouping, 

categorizing, splitting, slicing, drilling-down, or conditioning) 

and it is used both in exploratory data analyses and 

hypothesis testing. For example: A positive relation between 

the age and the risk of a heart attack may be different in 

males and females (it may be stronger in males). A promising 

relation between taking a drug and a decrease of the 

cholesterol level may be present only in women with a low 

blood pressure and only in their thirties and forties. The 

process capability indices or capability histograms can be 

different for periods of time supervised by different operators. 

The regression slopes can be different in different 

« 

experimental groups. 

There are many computational techniques that capitalize 
on grouping and that are designed to quantify the differences 
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that the grouping will reveal (e.g., ANOVA/MANOVA). 
However, graphical techniques (such as categorized graphs 
discussed in this section) offer unique advantages that 
cannot be substituted by any computational method alone: 
they can reveal patterns that cannot be easily quantified 
(e.g., complex interactions, exceptions, anomalies) and they 
provide unique, multidimensional, global analytic 
perspectives to explore or “mine” the data. 

What are Categorized Graphs 

Categorized graphs (the term first used in STATISTICA 
software by StatSoft in 1990; also recently called Trellis 
graphs, by Becker, Cleveland, and Clark, at Bell Labs) 
produce a series of 2D, 3D, ternary, or nD, one for each 
selected category of cases (i.e., subset of cases), for example, 
respondents from New York, Chicago, Dallas, etc. These 
“component” graphs are placed sequentially in one display, 
allowing for comparisons between the patterns of data shown 
in graphs for each of the requested groups (e.g., cities). 



A variety of methods can be used to select the subsets; 
the simplest of them is using a categorical variable (e.g., a 
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variable City , with three values New; York, Chicago , and 
Dallas ). For example, the following graph shows histograms 

of a variable representing self-reported stress levels in each 
of the three cities. 


One could conclude that the data suggest that people 
live in Dallas are less likely to report being stressed, while 
the patterns (distributions) of stress reporting in New York 
and Chicago are quite similar. 


Categorized graphs in some software systems (e.g., in 
STATISTICA) also support two-way or multi-way 
categorizations, where not one criterion (e.g., City ) but two 
or more criteria (e.g., City and Time of the day) are used to 
create the subsets. Two-way categorized graphs can be 
thought of as “crosstabulations of graphs” where each 
component graph represents a cross-section of one level of 
one grouping variable (e.g., City) and one level of the other 
grouping variable (e.g.. Time). 


Adding this second factor reveals that the patterns of 
stress reporting in New York and Chicago are actually quite 
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different when the Time of questioning is taken into 
consideration, whereas the Time factor makes little 
difference in Dallas. 

Categorized graphs vs: matrix graphs: Matrix 
graphs also produce displays containing multiple component 
graphs; however, each of those component graphs are (or 
can be) based on the same set of cases and the graphs are 
generated for all combinations of variables from one or two 
lists. Categorized graphs require a selection of variables 
that normally would be selected for non-categorized graphs 
of the respective type (e.g., two variables for a scatterplot). 
However, in categorized plots, you also need to specify at 
least one grouping variable (or some criteria to be used for 
sorting the observations into the categories) that contains 
information on group membership of each case (e.g., Chicago, 
Dallas). That grouping variable will not be included in the 
graph directly (i.e., it will not be plotted) but it will serve as 
a criterion for dividing all analyzed cases into separate 
graphs. As illustrated above, one graph will be created for 
each group (category) identified by the grouping variable. 

Common vs. Independent scaling: Each individual 
category graph may be scaled according to its own range of 
values ( independent scaling), or all graphs may be scaled to 
a common scale wide enough to accommodate all values in 
all of the category graphs. 

Common scaling allows the analyst to make comparisons 
of ranges and distributions of values among categories. 
However, if the ranges of values in graph categories are 
considerably different (causing a very wide common scale), 
then some of the graphs may be difficult to examine. The 
use of independent scaling may make it easier to spot trends 
and specific patterns within categories, but it may be more 
difficult to make comparisons of ranges of values among 
categories. 
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Categorization Methods 

There are five general methods of categorization of values 
and they will be reviewed briefly in this section: Integer 
mode, Categories, Boundaries, Codes, and Multiple subsets. 
Note that the same methods of categorization can be used 
to categorize cases into component graphs and to categorize 
cases within component graphs. 

Integer Mode: When you use Integer Mode, integer 
values of the selected grouping variable will be used to 
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define the categories, and one graph will be created for all 
cases that belong each category (defined by those integer 
values). If the selected grouping variable contains non¬ 
integer values, the software will usually truncate each 
encountered value of the selected grouping variable to an 
integer value. 



Categories: With this mode of categorization, you will 
specify the number of categories, which you wish to use. 
The software will divide the entire range of values of the 
selected grouping variable (from minimum to maximum) 
into the requested number of equal length intervals. 
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Boundaries: The Boundaries method will also create 
interval categorization, however, the intervals can be of 
arbitrary (e.g uneven) width as defined by custom interval 
boundaries (for example, “less than -10,” “greater than or 
equai to -10 but less than 0,” “greater than or equal to 0 
but less than 10, and “equal to or greater than 10”). 



Codes: Use this method if the selected grouping variable 
contams “codes” (i.e., specific, meaningful values such as 
Male, Female) from which you want to specify the categories. 
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Multiple subsets: This method allows you to custom- 
define the categories and enables you to use more than one 
variable to define the category In other words, 
categorizations based on multiple subset definitions of 
categories may not represent distributions of specific 
(individual) variables but distributions of frequencies of 
specific “events” defined by particular combinations of values 
of several variables (and defined by conditions which may 
involve any number of variables from the current data set). 
For example, you might specify six categories based on 
combinations of three variables Gender , Age, and 


Employment. 



Histograms 

In general, histograms are used to examine frequency 
distributions of values of variables. For example, the 
frequency distribution plot shows which specific values or 
ranges of values of the examined variable are most frequent, 
how differentiated the values are, whether most observations 
are concentrated around the mean, whether the distribution 
is symmetrical or skewed, whether it is multimodal (i.e., 
has two or more peaks) or unimodal, etc. Histograms are 
also useful for evaluating the similarity of an observed 
distribution with theoretical or expected distributions. 
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Categorized Histograms allow you to produce histograms 
broken down by one or more categorical variables, or by 
any other one or more sets of logical categorization rules. 

There are two major reasons why frequency distributions 
are of interest. One may learn from the shape of the 
distribution about the nature of the examined variable (e.g., 
a bimodal distribution may suggest that the sample is not 
homogeneous and consists of observations that belong to 
two populations that are more or less normally distributed). 
Many statistics are based on assumptions about the 
distributions of analyzed variables; histograms help one to 
test whether those assumptions are met. Often, the first 

step in the analysis of a new data set is to run histograms 
on all variables. 

Histograms vs. Breakdown: Categorized Histograms 
provide information similar to breakdowns. Although specific 
(numerical) descriptive statistics are easier to read in a 
table, the overall shape and global descriptive characteristics 
of a distribution are much easier to examine in a graph. 
Moreover, the graph provides qualitative information about 
the distribution that cannot be fully represented by any 
single index. For example, the overall skewed distribution 
of income may indicate that the majority of people have an 
income that is much closer to the minimum than maximum 
of the range of income. Moreover, when broken down by 
gender and ethnic background, this characteristic of the 
income distribution may be found to be more pronounced in 
certain subgroups. Although this information will be 
contained in the index of skewness (for each sub-group), 
when presented in the graphical form of a histogram, the 
information is usually more easily recognized and 
remembered. TJie histogram may also reveal “bumps” that 
may represent important facts about the specific social 
stratification of the investigated population or anomalies in 
the distribution of income in a particular group caused by a 
recent tax reform. 
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Categorized histograms and scatterplots: A useful 
application of the categorization methods for continuous 
variables is to represent the simultaneous relationships 
between three variables. Shown below is a scatterplot for 
two variables Load 1 and Load 2. 



Now suppose you would like to add a third variable 
(Output) and examine how it is distributed at different levels 
of the joint distribution of Load 1 and Load 2. The following 
graph could be produced: 
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In this graph, Load 1 and Load 2 are both categorized 
into 5 intervals, and within each combination of intervals 
the distribution for variable Output is computed. Note that 
the box (parallelogram) encloses approximately the same 
observations (cases) in both graphs shown above. 

Scatterplots 

In general, two-dimensional scatterplots are used to 
visualize relations between two variables X and Y (e.g., 
weight and height). In scatterplots, point markers in two- 
dimensional space, where axes represent the variables, 
represent individual data points. The two coordinates (X 
an( l Y)» which determine the location of each point, 
correspond to its specific values on the two variables. If the 
two variables are strongly related, then the data points 
form a systematic shape ( e.g ., a straight line or a clear 
curve). If the variables are not related, then the points form 
a round “cloud.” 

The categorized scatterplot option allows you to produce 
scatterplots categorized by one or more variables. Via the 
Multiple Subsets method, you can also categorize the 
scatterplot based on logical selection conditions that define 
each category or group of observations. 

Categorized scatterplots offer a powerful exploratory and 
analytic technique for investigating relationships between 
two or more variables within different sub-groups. 

Homogeneity of Bivariate Distributions (Shapes 
of Relations): Scatterplots are typically used to identify 
the nature of relations between two variables (e.g., blood 
pressure and cholesterol level), because they can provide 
much more information than a correlation coefficient. 

For example, a lack of homogeneity in the sample from 
which a correlation was calculated can bias the value of the 
correlation. Imagine a case where a correlation coefficient 
is calculated from data points which came from two different 
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hut this fact was ignored when the 
experimental group > Suppose the experimental 

correlation was cd*uto£<LJjj leased the values of 
manipulation in one an |^ hus the data from each group 
STSSSr^ in the scatterplot (as shown in 
the following illustration). 


In this example, the high correlation is entirely due to 
he arrangement of the two groups, and it does not represent 
;he “true” relation between the two variables, which is 
tactically equal to 0 (as could be seen if one looked at each 
ppup separately). 
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If you suspect that such pattern may exist in your data 
and you know how to identify the possible “subsets” of data, 
then producing a categorized scatterplot may yield a more 
accurate picture of the strength of the relationship between 
the X and Y variable, within each group {i.e ., after controlling 
for group membership). 

Curvilinear Relations: Curvilinearity is another aspect 
of the relationships between variables, which can be 
examined in scatterplots. There are no “automatic” or easy- 
to-use tests to measure curvilinear relationships between 
variables: The standard Pearson r coefficient measures only 
linear relations; some nonparametric correlations such as 
the Spearman R can measure curvilinear relations, but not 
non-monotonous relations. Examining scatterplots allows one 
to identify the shape of relations, so that later an appropriate 
data transformation can be chosen to “straighten” the data 

or choose an appropriate nonlinear estimation equation to 
be fit. 

Probability Plots 

Three types of categorized probability plots are Normal, 
Half Normal, and Detrended. Normal probability plots 
provide a quick way to visually inspect to what extent the 
pattern of data follows a normal distribution. 
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Via categorized probability plots, one can examine how 
closely the distribution of a variable follows the normal 
distribution in different sub-groups. 


Categorized normal probability plots provide an efficient 
tool to examine the normality aspect of group homogeneity. 
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Quantile-Quantile Plots 

The categorized Quantile-Quantile (or Q-Q) plot is useful 
for finding the best fitting distribution within a family of 
distributions. 
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With Categorized Q-Q pl ot8> a series of Quantile- 

Quantile (or Q-Q) plots, one for each category of cases 

identified by the X or X and Y category variables are 
produced. 

Probability-Probability Plots 

The categorized Probability-Probability (or P-P) plot is 
useful for determining how well a specific theoretical 
distribution fits the observed data. This type of graph 
includes a series of Probability-Probability (or P-P) plots, 
one for each category of cases identified by the X or X and Y 
category variables. 



In the P-P plot, the observed cumulative distribution 
function (the proportion of non-missing values < x) is plotted 
against a theoretical cumulative distribution function in 
order to assess the fit of the theoretical distribution to the 
observed data. If all points in this plot fall onto a diagonal 
line (with intercept 0 and slope 1), then you can conclude 
that the theoretical cumulative distribution adequately 
approximates the observed distribution. If the data points 
do not all fall on the diagonal line, then you can use this 
plot to visually assess where the data do and do not follow 
the distribution (e.g., if the points form an S shape along 




Statistical Methods in Applied Biology 
184___— - -— 

oi line then the data may need to be transformed 
•no^ef tobring^'themto the desired distribution pattern). 


Line Plots . , , 

T nlots a line connects individual data points. 

T ■ p Jote provide a simple way to visually present a 
Line P lots P values (e.g., stock market quotes over a 
sequence of Y rf d Line pi ots graph is useful 

Xn i: f J2S JSSS. data broken down (categorized) 

by a gaping variable (e.g., closing sfattk quotes on Mondays, 
tX s P ete.) or some other logical criteria mvolvmg one 
or more other variables (e.g., closing quotes only for those 
daysTwhen two other stocks and the Dow Jones index went 
up, versus all other closing quotes). 


Box Plots 

In Box Plots (the term first used by Tukey, 1970), ranges 
of values of a selected variable (or variables) are plotted 
separately for groups of cases defined by values of up to 
three categorical (grouping) variables, or as defined by 
Multiple Subsets categories. 



The central tendency (e.g. , median or mean), and range 
or variation statistics are computed for each group of cases, 
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and the selected values are presented in one of five styles 
(Box Whiskers, Whiskers, Boxes, Columns, or High-Low 
Close). Outlier data points can also be plotted (see the 
sections on outliers and extremes). 

For example, in the following graph, outliers (in this 
case, points greater or less than 1.5 times the int er-quartile 
range) indicate a particularly “unfortunate” flaw in an 
otherwise nearly perfect combination of factors: 



However, in the following graph, no outliers or extreme 
values are evident. 
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There are two typical applications for box plots: ( a ) 
showing ranges of values for individual items, cases or 
samples [e.g., a typical MIN-MAX plot for stocks or 
commodities or aggregated sequence data plots with ranges), 
and (b) showing variation of scores in individual groups or 
samples ( e.g ., box and whisker plots presenting the mean 
for each sample as a point inside the box, standard errors 
as the box, and standard deviations around the mean as a 
narrower box or a pair of whiskers ). 

Box plots showing variation of scores allow one to quickly 
evaluate and “intuitively envision” the strength of the 
relation between the grouping and dependent variable. 
Specifically, assuming that the dependent variable is 
normally distributed, and knowing what proportion of 
observations fall, for example, within ±1 or ±2 standard 
deviations from the mean, one can easily evaluate the results 
of an experiment and say that, for example, the scores in 
about 95% of cases in experimental group 1 belong to a 
different range than scores in about 95% of cases in group 
2. In addition, so-called trimmed means (this term was first 
used by Tukey, 1962) may be plotted by excluding a user- 
specified percentage of cases from the extremes ( i.e ., tails) 
of the distribution of cases. 

Pie Charts 

The pie chart is one of the most common graph formats 
used for representing proportions or values of variables. 
This graph allows you to produce pie charts broken down 
by one or more other variables {e.g., grouping variables such 
as gender) or categorized according to some logical selection 
conditions that identify Multiple Subsets. 

For purposes of this discussion, categorized pie charts 
will always be interpreted as frequency pie charts (as 
opposed to data pie charts). This type of pie chart (sometimes 
called a frequency pie chart) interprets data like a histogram. 

It categorizes all values of the selected variable following 
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the selected categorization technique and then displays the 
relative frequencies as pie slices of proportional sizes. Thus, 
these pie charts offer an alternative method to display 
frequency histogram data. 
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Pie-Scatterplots: Another useful application of 
categorized pie charts is to represent the relative frequency 
distribution of a variable at each “location” of the joint 
distribution of two other variables. Here is an example: 





and LH42 Act Dae CKM.ITY 






188 


Statistical Methods in Applied Biology 


Note that pies are only drawn in “places” where there 
are data. Thus, the graph shown above takes on the 
appearance of a scatterplot (of variables LI and L2), with 
the individual pies as point markers. However, in addition 
to the information contained in a simple scatterplot, each 
pie shows the relative distribution of a third variable at the 
respective location (i.e., Low, Medium, and High Quality). 

Missing/Range Data Points Plots 

This graph produces a series of 2D graphs (one for each 
category of cases identified by the grouping variables or by 
the Multiple Subset criteria; of missing data points and/or 
user-specified “out of range” points from which you can 
visualize the pattern or distribution of missing data (and/or 
user-specified “out of range” points) within each subset of 
cases (category). 



This graph is useful in exploratory data analysis to 
determine the extent of missing (and/or “out of range”) data 
and whether the patterns of those data occur randomly. 
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3D Plots 

This type of graph allows you to produce 3D scatterplots, 
contour plots, and surface plots for subsets of cases defined 
by the specified categories of a selected variable or categories 
determined by user-defined case selection conditions. Thus, 
the general purpose of this plot is to facilitate comparisons 
between groups or categories regarding the relationships 
between three or more variables. 
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Applications: In general, 3D XYZ graphs summarize 
the interactive relationships between three variables. The 
different ways in which data can be categorized (in a 
Categorized Graph) allow one to review those relationships 
contingent on some other criterion (e.g., group membership). 

For example, from the categorized surface plot shown 
below, one can conclude that the setting of the tolerance 
level in an apparatus does not affect the investigated 
relationship between the measurements (Dependl, Depend2, 
and Height) unless the setting is < 3. 
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The effect is more salient when you switch to the contour 
plot representation. 
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Ternary Plots 

A categorized ternary plot can be used to examine 
relations between three or more dimensions where three of 
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those dimensions represent components of a mixture (i.e., 
the relations between them is constrained such that the 
values of the three variables add up to the same constant 
for each case) for each level of a grouping variable. 



In ternary plots, the triangular coordinate systems are 
used to plot four (or more) variables (the components X, Y, 
and Z, and the responses VI, V2, etc.) in two dimensions 
(ternary scatterplots or contours) or three dimensions 
(ternary surface plots). In order to produce ternary graphs 
the relative proportions of each component within each case 
are constrained to add up to the same value (e.g., 1). 

In a categorized ternary plot, one component graph is 
produced for each level of the grouping variable (or user- 
defined subset of data) and all the component graphs are 
arranged in one display to allow for comparisons between 
the subsets of data (categories). 

Applications: A typical application of this graph is when 
the measured response(s) from an experiment depends on 
the relative proportions of three components (e.g., three 
different chemicals) which are varied in order to determine 
an optimal combination of those components (e.g., in mixture 
designs). This type of graph can also be used for other 
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applications where relations between constrained variables 
need to be compared across categories or subsets of data. 



Brushing 

Perhaps the most common and historically first widely 
used technique explicitly identified as graphical exploratory 
data analysis is brushing , an interactive method allowing 
one to select on-screen specific data points or subsets of data 
and identify their {e.g. , common) characteristics, or to examine 
their effects on relations between relevant variables {e.g., in 
scatterplot matrices) or to identify {e.g. , label) outliers. 

Those relations between variables can be visualized by 
fitted functions {e.g., 2D lines or 3D surfaces) and their 
confidence intervals, thus, for example, one can examine 
changes in those functions by interactively (temporarily) 
removing or adding specific subsets of data. For example, 
one of many applications of the brushing technique is to 
select {i.e., highlight) in a matrix scatterplot all data points 
that belong to a certain category {e.g., a “medium” income 
level, see the highlighted subset in the upper right 
component graph in illustration below): 
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in order to examine how those specific observations 
contribute to relations between other variables in the same 

data set (e.g, the correlation between the “debt” and “assets” 
in the current example). 

If the brushing facility supports features like “animated 
brushing” (see example below) or “automatic function re¬ 
fitting,” one can define a dynamic brush that would move 
over the consecutive ranges of a criterion variable (e.g., 
“income” measured on a continuous scale and not a discrete 
scale as in the illustration to the above) and examine the 
dynamics of the contribution of the criterion variable to 
the relations between other relevant variables in the same 
data set. 
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Smoothing Bivariate Distributions 

Three-dimensional histograms are used to visualize 
crosstabulations of values in two variables. They can be 
considered to be a conjunction of two simple (i.e., univariate) 
histograms, combined such that the frequencies of co¬ 
occurrences of values on the two analyzed variables can be 
examined. In a most common format of this graph, a 3D bar 
is drawn for each “cell” of the crosstabulation table and the 
height of the bar represents the frequency of values for the 
respective cell of the table. Different methods of 
categorization can be used for each of the two variables for 
which the bivariate distribution is visualized. 



If the software provides smoothing facilities, you can fit 
surfaces to 3D representations of bivariate frequency data. 
Thus, every 3D histogram can be turned into a smoothed 
surface. This technique is of relatively little help if applied 
to a simple pattern of categorized data (such as the 
histogram that was shown above). 

However, if applied to more complex patterns of 
frequencies, it may provide a valuable exploratory technique, 
allowing identification of regularities which are less salient 
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when examining the standard 3D histogram representations 

thf ’JHZJ£*a !^ matlC surface “wave-patterns” shown on 
the smoothed histogram above). 
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Layered Compression 

When layered compression is used, the main graph 
plotting area is reduced in size to leave space for Margin 
Graphs in the upper and right side of the display (and a 
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miniature graph in the corner). These smaller Margin 
Graphs represent vertically and horizontally compressed 
images (respectively) of the main graph. 

In 2D graphs, layered compression is an exploratory 
data analysis technique that may facilitate the identification 
of otherwise obscured trends and patterns in 2-dimensional 
data sets. For example, in the following illustration 



(based on an example discussed by Cleveland, 1993), it can 
be seen that the number of sunspots in each cycle decays 
more slowly than it rises at the onset of each cycle. This 
tendency is not readily apparent when examining the 
standard line plot; however, the compressed graph uncovers 
the hidden pattern. 

Projections of 3D data sets 

Contour plots generated by projecting surfaces (created 
from multivariate, typically three-variable, data sets) offer 
a useful method to explore and analytically examine the 
shapes of surfaces. 
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As compared to surface plots, they may be less effective 
to quickly visualize the overall shape of 3D data structures, 



however, their main advantage is that they allow for precise 
examination and analysis of the shape of the surface 
{Contour Plots display a series of undistorted horizontal 
“cross sections” of the surface). 
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Icon Plots 

Icon Graphs represent cases or units of observation as 
multidimensional symbols and they offer a powerful although 
not easy to use exploratory technique. The general idea 
behind this method capitalizes on the human ability to 
“automatically” spot complex (sometimes interactive) 
relations between multiple variables if those relations are 
consistent across a set of instances (in this case icons ). 
Sometimes the observation (or a “feeling”) that certain 
instances are “somehow similar” to each other comes before 
the observer (in this case an analyst) can articulate which 
specific variables are responsible for the observe 
consistency. However, further analysis that focuses on such 
intuitively spotted consistencies can reveal the specific 
nature of the relevant relations between variables. 
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The basic idea of icon plots is to represent individual 
units of observation as particular graphical objects where 
values of variables are assigned to specific features or 
dimensions of the objects (usually one case = one object). 
The assignment is such that the overall appearance of the 
object changes as a function of the configuration of values. 
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Thus, the objects are given visual “identities” that are 
unique for configurations of values and that can be identified 
by the observer. Examining such icons may help to discover 
specific clusters of both simple relations and interactions 
between variables. 


Analyzing Icon Plots 

The “ideal” design of the analysis of icon plots consists 
of five phases: Select the order of variables to be analyzed. 
In many cases a random starting sequence is the best 
solution. You may also try to enter variables based on the 
order in a multiple regression equation, factor loadings on 
an interpretable factor or a similar multivariate technique. 
That method may simplify and “homogenize” the general 
appearance of the icons, which may facilitate the 
identification of non-salient patterns. It may also, however, 
make some interactive patterns more difficult to find. No 
universal recommendations can be given at this point, other 
than to try the quicker (random order) method before getting 
involved in the more time-consuming method. 
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Look for any potential regularity, such as similarities 
between groups of icons, outliers, or specific relations 
between aspects of icons (e.g ., “if the first two rays of the 
star icon are long, then one or two rays on the other side of 
the icon are usually short”). The Circular type of icon plots 
is recommended for this phase. If any regularity is 

found, 

try to identify them in terms of the specific variables 
involved. 


Reassign variables to features of icons (or switch to one 
of the sequential icon plots) to verify the identified structure 
of relations (e.g ., try to move the related aspects of the icon 
closer together to facilitate further comparisons). In some 
cases, at the end of this phase it is recommended to drop 
the variables that appear not to contribute to the identified 
pattern. 

Finally, use a quantitative method to test and quantify 
the identified pattern or at least some aspects of the pattern. 


Taxonomy of Icon Plots 

Most icon plots can be assigned to one of two categories: 
circular and sequential. 


Circular icons: Circular icon plots (star plots, sun ray 
plots, polygon icons) follow a “spoked wheel” format where 
values of variables are represented by distances between 
the center (“hub”) of the icon and its edges. 
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Those icons may help to identify interactive relations 
between variables because the overall shape of the icon 
may assume distinctive and identifiable overall patterns 
depending on multivariate configurations of values of input 
variables. In order to translate such “overall patterns” into 
specific models (in terms of relations between variables) or 
verify specific observations about the pattern, it is helpful 
to switch to one of the sequential icon plots, which may 
prove more efficient when one already knows what to 
look for. 


Sequential icons: Sequential icon plots (column icons, 
profile icons, line icons) follow a simpler format where 
individual symbols are represented by small sequence plots 
(of different types). 
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The values of consecutive variables are represented in 
those plots by distances between the base of the icon and 
the consecutive break points of the sequence (e.g., the height 
of the columns shown above). Those plots may be less 
efficient as a tool for the initial exploratory phase of icon 
analysis because the icons may look alike. However, as 
mentioned before, they may be helpful in the phase when 
some hypothetical pattern has already been revealed and 
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one needs to verify it or articulate it in terms of relations 
between individual variables. 


Pie icons: Pie icon plots fall somewhere in-between the 
previous two categories 5 all icons have the same shape (pie) 
but are sequentially divided in a different way according to 
the values of consecutive variables. From a functional point 
of view, they belong rather to the sequential than circular 
category, although they can be used for both types of 
applications. 



Chemoff faces: This type of icon is a category by itself. 
Schematic faces visualize cases such that relative values of 
variables selected for the graph are represented by variations 
of specific facial features. 



tee-fi Roi 0CQ«f! 


PINES 


(etrameftFarfifii Wp*) 


Mifca 


St jorm 




CPCO**K* 


it 

5t Lucia 




Mamnique 





Representative Visualization Techniques 


203 


Due to its unique features, some researchers consider it 
as an ultimate exploratory multivariate technique that is 
capable of revealing hidden patterns of interrelations 
between variables that cannot be uncovered by any other 
technique. This statement may be an exaggeration, however. 
Also, it must be admitted that Chernoff Faces is a method 
that is difficult to use, and it requires a great deal of 

experimentation with the assignment of variables to facial 
features. 

Standardization of Values 

Except for unusual cases when you intend for the icons 
to reflect the global differences in ranges of values between 
the selected variables, the values of the variables should be 
standardized once to assure within-icon compatibility of 
value ranges. For example, because the largest value sets 
the global scaling reference point for the icons, then if there 
are variables that are in a range of much smaller order, 
they may not appear in the icon at all, e.g., in a star plot! 
the rays that represent them will be too short to be visible. * 

Applications 

Icon plots are generally applicable (1) to situations where 
one wants to find systematic patterns or clusters of 
observations, and (2) when one wants to explore possible 
complex relationships between several variables. The first 
type of application is similar to cluster analysis; that is, it 
can be used to classify observations. 

For example, suppose you studied the personalities of 
artists, and you recorded the scores for several artists on a 
number of personality questionnaires. The icon plot may 
help you determine whether there are natural clusters of 
artists distinguished by particular patterns of scores on 
different questionnaires (e.g., you may find that some artists 
are very creative, undisciplined, and independent, while a 
second group is particularly intelligent, disciplined, and 
concerned with publicly-acknowledged success). 
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The second type of application—the exploration of 
relationships between several variables—is more similar to 
factor analysis; that is, it can be used to detect which 
variables tend to “go together.” For example, suppose you 
were studying the structure of people’s perception of cars 
Several subjects completed detailed questionnaires rating 
different cars on numerous dimensions. In the data file, the 
average ratings on each dimension (entered as the variables) 
for each car (entered as cases or observations) are recorded. 


When you now study the Chernoff faces (each face 
representing the perceptions for one car), it may occur to 
you that smiling faces tend to have big ears; if price was 
assigned to the amount of smile and acceleration to the size 
of ears, then this “discovery” means that fast cars are more 
expensive. This, of course, is only a simple example; in real- 
life exploratory data analyses, non-obvious complex 
relationships between variables may become apparent. 


Related Graphs 

Matrix plots visualize relations between variables from 
one or two lists. If the software allows you to mark selected 
subsets, matrix plots may provide information similar to 
that in icon plots. 

If the software allows you to create and identify user- 
defined subsets in scatterplots, simple 2D scatterplots can 
be used to explore the relationships between two variables; 
likewise, when exploring the relationships between three 
variables, 3D scatterplots provide an alternative to icon plota 

Graph Type 

There are various types of Icon Plots. 

Chernoff Faces: A separate “face” icon is drawn for 
each case; relative values of the selected variables for each 
case are assigned to shapes and sizes of individual facial 
features ( e.g ., length of nose, angle of eyebrows, width of 
face). 




Stars: Star Icons is a circular type of icon plot. A 
separate star-like icon is plotted for each case; relative values 
of the selected variables for each case are represented 
(clockwise, starting at 12:00) by the length of individual 

rays in each star. The ends of the rays are connected by a 
line. 



Sun Rays: Sun Ray Icons is a circular type of icon plot. 
A separate sun-like icon is plotted for each case; each ray 
represents one of the selected variables (clockwise, starting 
at 12:00), and the length of the ray represents the relative 
value of the respective variable. Data values of the variables 
for each case are connected by a line. 




Polygons: Polygon Icons is a circular type of icon plot. 
A separate polygon icon is plotted for each case; the distance 
represents relative values of the selected variables for each 
case from the center of the icon to consecutive comers of 
the polygon (clockwise, starting at 12:00). 
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Pies: Pie Icons is a circular type of icon plot. Data values 
for each case are plotted as a pie chart (clockwise, starting 
at 12:00); relative values of selected variables are 
represented by the size of the pie slices. 
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Columns: Column Icons is a sequential type of icon 
plot. An individual column graph is plotted for each case; 
relative values of the selected variables for each case are 
represented by the height of consecutive columns. 



Lines: Line Icons is a sequential type of icon plot. 
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An individual line graph is plotted for each case- relat: 
values of the selected variables for each case are represent! 

by the height of consecutive break points of the line ah ** 
the baseline. 0Ve 


Profiles: Profile Icons is a sequential type of icon plot 

An individual area graph is plotted for each case; relative 

values of the selected variables for each case are represented 

by the height of consecutive peaks of the profile above the 
baseline. 



Mark Icons 


If the software allows you to specify multiple subsets, it 
is useful to specify the cases (subjects) whose icons will be 
marked (i.e., frames will be placed around the selected icons) 
in the plot. 
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The line patterns of frames, which identify specific 
subsets, should be listed in the legend along with the case 

se ec ion con ions. The following graph shows an example 
of marked subsets. 



All cases (observations) which meet the condition 
specified in Subset 1 (i.e., cases for which the value of 
variable Iristype is equal to Setosa and for which the case 
number is less than 100) are marked with a specific frame 
around the selected icons. 


All cases which meet the condition outlined in Subset 2 
(i.e., cases for which the value of Iristype is equal to Virginic 
and for which the case number is less than 100) are assigned 
a different frame around the selected icons. 


Data Reduction 

Sometimes plotting an extremely large data set, can 
obscure an existing pattern (see the animation below). When 
you have a very large data file, it can be useful to plot only 
a subset of the data, so that the pattern is not hidden by 
the number of point markers. 
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Some software products offer methods for data reduction 
(or optimizing), which can be useful in these instances. 
Ideally, a data reduction option will allow you to specify an 
integer value n less than the number of cases in the data 
file. Then the software will randomly select approximately 
n cases from the available cases and create the plot based 

on these cases only. 

Note that such data set (or sample size) reduction 
methods effectively draw a random sample from the current 
data set. Obviously, the nature of such data reduction is 
entirely different than when data are selectively reduced 
only to a specific subset or split into subgroups based on 
certain criteria (e.g., such as gender, region, or cholesterol 
level). The latter methods can be implemented interactively 
or other techniques. All these methods can further aid in 
identifying patterns in large data sets. 

Data Rotation (in 3D space) 

Changing the viewpoint for 3 D scatterplots may prove 
to be an effective exploratory technique since it can reveal 
patterns that are easily obscured unless you look at the 
cloud” of data points from an appropriate angle (see the 
animation below). 
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Some software products offer interactive perspective, 
rotation, and continuous spinning controls, which can be 
useful in these instances. Ideally, these controls will allow 
you to adjust the graph's angle and perspective to find the 
most informative location of the “viewpoint” for the graph 
as well as allowing you to control the vertical and horizontal 
rotation of the graph. 





Experimental Design 
(Industrial DOE) 


DOE OVERVIEW 

Experiments in Science and Industry 

Experimental methods are widely used in research as well 
as in industrial settings, however, sometimes for very 
different purposes. The primary goal in scientific research 
is usually to show the statistical significance of an effect 
that a particular factor exerts on the dependent variable of 
interest. 

In industrial settings, the primary goal is usually to 
extract the maximum amount of unbiased information 
regarding the factors affecting a production process from as 
few (costly) observations as possible. While in the former 
application (in science) analysis of variance ( ANOVA) 
techniques are used to uncover the interactive nature of 
reality, as manifested in higher-order interactions of factors, 
in Indus” nuisance” (they are often of no interest; they only 
complicate the process of identifying important factors). 
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Differences in Techniques 


These differences in purpose have a profound effect on 
the techniques that are used in the two settings. If you 
review a standard ANOVA text for the sciences, for example 
the classic texts by Winer (1962) or Keppel (1982), y ou 
find that they will primarily discuss designs with up to 
perhaps, five factors. The focus of these discussions is how 
to derive valid and robust statistical significance tests. 
However, if you review standard texts on experimentation 
in industry you will find that they will primarily discuss 
designs with many factors ( e.g ., 16 or 32) in which interaction 
effects cannot be evaluated, and the primary focUs of the 
discussion is how to derive unbiased main effect (and, 
perhaps, two-way interaction) estimates with a minimum 
number of observations. 


This comparison can be expanded further, however, a 
more detailed description of experimental design in industry 
will now be discussed and other differences will become 
clear. Note that the General Linear Models and ANOVAl 
MANOVA chapters comtain detailed discussions of typical 
design issues in scientific research; the General Linear Model 
procedure is a very comprehensive implementation of the 
general linear model approach to ANOVA/MANOVA 
(univariate and multivariate ANOVA). There are of course 
applications in industry where general ANOVA designs, as 
used in scientific research, can be immensely useful. You 
may want to read the General Linear Models and ANOVA/ 
MANOVA chapters to gain a more general appreciation of 

the range of methods encompassed by the term Experimental 
Design. 


Overview 

The general ideas and principles on which 
experimentation in industry is based, and the types of 
designs used will be discussed in the following paragraphs. 
The following paragraphs are meant to be introductory in 
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nature. However, it is assumed that you are familiar with 
the basic ideas of analysis of variance and the interpretation 
of main effects and interactions in ANOVA. 

General Ideas 

In general, every machine used in a production process 
allows its operators to adjust various settings, affecting the 
resultant quality of the product manufactured by the 
machine. Experimentation allows the production engineer 
to adjust the settings of the machine in a systematic manner 
and to learn which factors have the greatest impact on the 
resultant quality. Using this information, the settings can 
be constantly improved until optimum quality is obtained. 
To illustrate this reasoning, here are a few examples: 

V 

Example 1: Dyestuff manufacture. Box and Draper 
(1987, page 115) report an experiment concerned with the 
manufacture of certain dyestuff. Quality in this context can 
be described in terms of a desired (specified) hue and 
brightness and maximum fabric strength. Moreover, it is 
important to know what to change in order to produce a 
different hue and brightness should the consumers’ taste 
change. Put another way, the experimenter would like to 
identify the factors that affect the brightness, hue, and 
strength of the final product. In the example described by 
Box and Draper, there are 6 different factors that are 
evaluated in a 2**(6-0) design (the 2**(k-p) notation is 
explained below). The results of the experiment show that 
the three most important factors determining fabric strength 
are the Polysulfide index, Time , and Temperature (Box and 
Draper, 1987, page 116). One can summarize the expected 
effect (predicted means) for the variable of interest ( i.e ., 
fabric strength in this case) in a so- called cube-plot. This 
plot shows the expected (predicted) mean fabric strength 
for the respective low and high settings for each of the 
three variables (factors). 
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Example 1.1: Screening designs. In the previous 
example, 6 different factors were simultaneously evaluated. 
It is not uncommon, that there are veiy many {e.g ., 100) 
different factors that may potentially be important. Special 
designs have been developed to screen such large numbers 
of factors in an efficient manner, that is, with the least 
number of observations necessary. For example, you can 
design and analyze an experiment with 127 factors and 
only 128 runs (observations); still, you will be able to 
estimate the main effects for each factor, and thus, you can 
quickly identify which ones are important and most likely 
to yield improvements in the process under study. 

Example 2: 3**3 design: Montgomery describes an 

experiment conducted in order identify the factors that 

contribute to the loss of soft drink syrup due to frothing 

during the filling of five-gallon metal containers. Three 

factors where considered: (a) the nozzle configuration, (b) 

the operator of the machine, and (c) the operating pressure. 

Each factor was set at three different levels, resulting in a 

complete 3**(3-0) experimental design (the 3**(k-p) notation 
is explained below). 
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Moreover, two measurements were taken for each 
combination of factor settings, that is, the 3**(3-0) design 
was completely replicated once. 


Example 3: Maximizing yield of a chemical reaction. 
The yield of many chemical reactions is a function of time 
and temperature. Unfortunately, these two variables often 
do not affect the resultant yield in a linear fashion. In other 
words, it is not so that “the longer the time, the greater the 
yield” and “the higher the temperature, the greater the 
yield.” Rather, both of these variables are usually related in 
a curvilinear fashion to the resultant yield. 
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Thus, in this example your goal as experimenter would 
be to optimize the yield surface that is created by the two 
variables: time and temperature. 

Example 4: Testing the effectiveness of four fuel 
additives. Latin square designs are useful when the factors 
of interest are measured at more than two levels, and the 
nature of the problem suggests some blocking. For example, 
imagine a study of 4 fuel additives on the reduction in 
oxides of nitrogen. You may have 4 drivers and 4 cars at 
your disposal. You are not particularly interested in any 
effects of particular cars or drivers on the resultant oxide 
reduction; however, you do not want the results for the fuel 
additives to be biased by the particular driver or car. Latin 
square designs allow you to estimate the main effects of all 
factors in the design in an unbiased manner. With regard 
to the example, the arrangement of treatment levels in a 
Latin square design assures that the variability among 
drivers or cars does not affect the estimation of the effect 
due to different fuel additives. 

Example 5: Improving surface uniformity in the 
manufacture of poly silicon wafers. The manufacture of 
reliable microprocessors requires very high consistency in 
the manufacturing process. Note that in this instance, it is 
equally, if not more important to control the variability of 
certain product characteristics than it is to control the 
average for a characteristic. For example, with regard to 
the average surface thickness of the polysilicon layer, the 
manufacturing process may be perfectly under control; yet, 
if the variability of the surface thickness on a wafer 
fluctuates widely, the resultant microchips will not be 
reliable. Phadke (1989) describes how different charac¬ 
teristics of the manufacturing process (such as deposition 
temperature, deposition pressure, nitrogen flow, etc.) affect 
the variability of the polysilicon surface thickness on wafers. 
However, no theoretical model exists that would allow the 
engineer to predict how these factors affect the uniformness 
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In general, the overall constraint—that the three 
components must sum to a constant—is reflected in the 
triangular shape of the graph (see above). 


Example 6.1: Constrained mixture designs. It is 
particularly common in mixture designs that the relative 
amounts of components are further constrained (in addition 
to the constraint that they must sum to, for example, 100%). 
For example, suppose we wanted to design the best-tasting 



220 


Statistical Methods in Applied Biolog 

-—_.___ OJ 

fruit punch consisting of a mixture of juices from five fruits 
Since the resulting mixture is supposed to be a fruit punch 
pure blends consisting of the pure juice of only one fruit are 
necessarily excluded. Additional constraints may be placed 
on the “universe” of mixtures due to cost constraints or 
other considerations, so that one particular fruit cannot, for 
example, account for more than 30% of the mixtures 
(otherwise the fruit punch would be too expensive, the shelf- 
life would be compromised, the punch could not be produced 
in large enough quantities, etc.). Such so-called constrained 
experimental regions present numerous problems, which 
however, can be addressed. 
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In general, under those conditions, one seeks to design 
an experiment that can potentially extract the maximum 
amount of information about the respective response function 
(e.g ., taste of the fruit punch) in the experimental region of 
interest. 

Computational Problems 

There are basically two general issues to which 
Experimental Design is addressed: 

How to design an optimal experiment, and 

How to analyze the results of an experiment. 
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even unconfounded with the interaction of factors. 

Components of Variance, Denominator Synthesis 

There are several statistical methods for analyzing 
designs with random effects The Variance Components and 
Mixed Model ANOVA/ANCOVA chapter discusses numerous 
options foi estimating variance components for random 

effects, and for performing approximate F tests based on 
synthesized error terms. 


Summary 

Experimental methods are finding increasing use in 
manufacturing to optimize the production process. 
Specifically, the goal of these methods is to identify the 
optimum settings for the different factors that affect the 
production process. In the discussion so far, the major classes 
of designs that are typically used in industrial 
experimentation have been introduced: 2**(k-p) (two-level, 
multi-factor) designs, screening designs for large numbers 
of factors, 3 *(k-p) (three-level, multi-factor) designs (mixed 
designs with 2 and 3 level factors are also supported), central 
composite (or response surface) designs, Latin square designs, 
Taguchi robust design analysis, mixture designs, and special 
procedures for constructing experiments in constrained 
experimental regions. Interestingly, many of these 
experimental techniques have “made their way” from the 
production plant into management, and successful 
implementations have been reported in profit planning in 
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business, cash-flow optimization in banking, etc. (e.g., see 
Yokyama and Taguchi, 1975). 

These techniques will now be described in greater detail 
in the following sections: 

2**(k-p) Fractional Factorial Designs at 2 Levels 
Basic Idea 

In many cases, it is sufficient to consider the factors 
affecting the production process at two levels. For example, 
the temperature for a chemical process may either be set a 
little higher or a little lower, the amount of solvent in a 
dyestuff manufacturing process can either be slightly 
increased or decreased, etc. The experimenter would like to 
determine whether any of these changes affect the results 
of the production process. The most intuitive approach to 
study those factors would be to vary the factors of interest 
in a full factorial design, that is, to try all possible 
combinations of settings. This would work fine, except that 
the number of necessary runs in the experiment 
(observations) will increase geometrically. For example, if 
you want to study 7 factors, the necessary number of runs 
in the experiment would be 2**7 = 128. To study 10 factors 
you would need 2**10 = 1,024 runs in the experiment. 
Because each run may require time-consuming and costly 
setting and resetting of machinery, it is often not feasible to 
require that many different production runs for the 
experiment. In these conditions, fractional factorials are 
used that “sacrifice” interaction effects so that main effects 
may still be computed correctly. 

Generating the Design 

A technical description of how fractional factorial designs 
are constructed is beyond the scope of this introduction. 
Detailed accounts of how to design 2**(k-p) experiments 
can be found, for example, in Bayne and Rubin (1986), Box 
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Reading the design . The design displayed above should 
6 lnter Prcted as follows. Each column contains +l’s or -l’s 
o indicate the setting of the respective factor (high or low, 
respectively). So for example, in the first run of the 
experiment, set all factors A through K to the plus setting 
e -£-, a little higher than before); in the second run, set 
a ctors A, B, and C to the positive setting, factor D to the 
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negative setting, and so on. Note that there are numerous 
options provided to display (and save) the design using 
notation other than ±1 to denote factor settings. For example, 
you may use actual values of factors (e.g., 90 degrees Celsius 
and 100 degrees Celsius) or text labels ( Low temperature, 
High temperature). 

Randomizing the runs: Because many other things 
may change from production run to production run, it is 
always a good practice to randomize the order in which the 
systematic runs of the designs are performed. 

The Concept of Design Resolution 

The design above is described as a 2**(7) design of 
resolution III (three). This means that you study overall k = 
11 factors (the first number in parentheses); however, p - 7 
of those factors (the second number in parentheses) were 
generated from the interactions of a full 2**[(7) = 4] factorial 
design. As a result, the design does not give full resolution ; 
that , is, there are certain interaction effects that are 
confounded with (identical to) other effects. In general, a 
design of resolution R is one where no l- way interactions 
are confounded with any other interaction of order less than 
R-l. In the current example, R is equal to 3. Here, no / = 1 
level interactions ( i.e ., main effects) are confounded with 
any other interaction of order less than R-l = 3-1 = 2. Thus, 
main effects in this design are confounded with two-way 
interactions; and consequently, all higher-order interactions 
are equally confounded. If you had included 64 runs, and 
generated a 2**(5) design, the resultant resolution would 
have been R = IV (four). You would have concluded that no 
/--1-way interaction (main effect) is confounded with any 
other interaction of order less than R-l = 4-1 = 3. In this 
design then, main effects are not confounded with two-way 
intei actions, but only with three-way interactions. What 
about Ihe two-way interactions? No 1-2- way interaction is 
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confounded with any other interaction of order less than R- 
l = 4-2 — 2. Thus, the two-way interactions in that design 
are confounded with each other. 

Plackett-Burman (Hadamard Matrix) Designs for 
Screening 

When one needs to screen a large number of factors to 
identify those that may be important ( i.e ., those that are 
related to the dependent variable of interest), one would 
like to employ a design that allows one to test the largest 
number of factor main effects with the least number of 
observations, that is to construct a resolution III design 
with as few runs as possible. One way to design such 
experiments is to confound all interactions with “new” main 
effects. Such designs are also sometimes called saturated 
designs, because all information in those designs is used to 
estimate the parameters, leaving no degrees of freedom to 
estimate the error term for the ANOVA. Because the added 
factors are created by equating, the “new” factors with the 
interactions of a full factorial design, these designs always 
will have 2**k runs (e.g., 4, 8, 16, 32, and so on). Plackett 
and Burman (1946) showed how full factorial design can be 
fractionalized in a different manner, to yield saturated 
designs where the number of runs is a multiple of 4, rather 
than a power of 2. These designs are also sometimes called 
Hadamard matrix designs. Of course, you do not have to 
use all available factors in those designs, and, in fact, 
sometimes you want to generate a saturated design for one 
more factor than you are expecting to test. This will allow 
you to estimate the random error variability, and test for 
the statistical significance of the parameter estimates. 

Enhancing Design Resolution via Foldover 

One way in which a resolution III design can be 
enhanced and turned into a resolution IV design is via 
foldover. Suppose you have a 7-factor design in 8 runs: 
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Design: 2**(4) design 

Run 

A 

B 

C 

D 

E 

F 

G 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

1 

1 

1 

1 

3 

1 

1 

1 

1 

1 

1 

1 

4 

1 

1 

1 

1 

1 

1 

1 

5 

1 

1 

1 

1 

1 

1 

1 

6 

1 

1 

1 

1 

1 

1 

1 

7 

1 

1 

1 

1 

1 

1 

1 

8 

1 

1 

1 

1 

1 

1 

1 | 


This is a resolution III design, that is, the two-way 
interactions will be confounded with the main effects. You 
can turn this design into a resolution IV design via the 
Foldover (enhance resolution) option. The foldover method 
copies the entire design and appends it to the end, reversing 
all signs: 


Design: 2**(4 design (+Foldover) 

Run 

A 

B 

C 

D 

E 

F 

G 

New 

H 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

1 

1 

1 

1 

1 

3 

1 

1 

1 

1 

1 

1 

1 

1 

4 

1 

1 

1 

1 

1 

1 

1 

1 

5 

1 

1 

1 

1 

1 

1 

1 

1 

6 

1 

1 

1 

1 

1 

1 

1 

1 

7 

1 

1 

1 

1 

1 

1 

1 

1 

8 

1 

1 

1 

1 

1 

1 

1 

1 

9 

1 

1 

1 

1 

1 

1 

1 

1 

10 

1 

1 

1 

1 

1 

1 

1 

1 

11 

1 

1 

1 

1 

1 

1 

1 

1 

12 

1 

1 

1 

1 

1 

1 

1 

1 

13 

1 

1 

1 

1 

1 

1 

1 

1 

14 

1 

1 

1 

1 

1 

1 

1 

1 

15 

1 

1 

1 

1 

1 

1 

1 

1 

16 

1 

1 

1 

1 

1 

1 

1 

1 
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Thus, the standard run number 1 was -I, -2, - J, 1 , I, I, 
.i; the new run number 9 (the first run of the “folded-over” 
portion) has all signs reversed: 1 , 1, 1, -I, -j, -1, L In, 
addition to enhancing the resolution of the design, we also 
have gained an 8’th factor (factor H), which contains all 
+l’s for the first eight runs, and -l’s for the folded-over 
portion of the new design. 

Aliases of Interactions: Design Generators 

To return to the example of the resolution R = III design, 
now that you know that main effects are confounded with 
two-way interactions, you may ask the question, “Which 
interaction is confounded with which main effect?” 


Factor 

Fractional Design Generators 
2**(7) design 

(Factors are denoted by numbers) 

Alias 

5 

123 

6 

234 

7 

134 

8 

124 

9 

1234 

10 

12 

11 

13 


Design generators: The design generators shown above 
are the “key” to how factors 5 through 11 were generated by 
assigning them to particular interactions of the first 4 factors 
of the full factorial 2**4 design. Specifically, factor 5 is 
identical to the 123 (factor 1 by factor 2 by factor 3) 
interaction. Factor 6 is identical to the 234 interactions, 
and so on. Remember that the design is of resolution III 
(three), and you expect some main effects to be confounded 
with some two-way interactions; indeed, factor 10 (ten) is 
identical to the 12 (factor 1 by factor 2) interaction, and 
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factor 11 (eleven) is identical to the 13 (factor 1 by factor 3) 
interaction. Another way in which these equivalencies are 
often expressed is by saying that the main effect for factor 
10 (ten) is an alias for the interaction of / by 2. 

To summarize, whenever you want to include fewer 
observations (runs) in your experiment than would be 
required by the full factorial 2**k design, you “sacrifice” 
interaction effects and assign them to the levels of factors. 
The resulting design is no longer a full factorial but a 
fractional factorial. 

The fundamental identity: Another way to summarize 
the design generators is in a simple equation. Namely, if, 
for example, factor 5 in a fractional factorial design is 
identical to the 123 (factor 1 by factor 2 by factor 3) 
interaction, then it follows that multiplying the coded values 
for the 123 interaction by the coded values for factor 5 will 
always result in +1 (if all factor levels are coded ± 1 ); or: 

I = 1235 

where I stands for +1 (using the standard notation as, for 
example, found in Box and Draper, 1987). Thus, we also 
know that factor 1 is confounded with the 235 interaction, 
factor 2 with the 135, interaction, and factor 3 with the 125 
interaction, because, in each instance their product must be 
equal to 1. The confounding of two-way interactions is also 
defined by this equation, because the 12 interactions 
multiplied by the 35 interaction must yield 1, and hence, 
they are identical or confounded. Therefore, one can 
summarize all confounding in a design with such a 
fundamental identity equation. 

Blocking 

In some production processes, units are produced in 
natural “chunks” or blocks. You want to make sure that 
these blocks do not bias your estimates of main effects. For 
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example, you may have a kiln to produce special ceramics, 
but the size of the kiln is limited so that you cannot produce 
9 11 runs of your experiment at once. In that case you need 
to break up the experiment into blocks. However, you do 
no t want to run positive settings of all factors in one block, 
and all negative settings in the other. Otherwise, any 
incidental differences between blocks would systematically 
affect all estimates of the main effects of the factors of 
interest. Rather, you want to distribute the runs over the 
blocks so that any differences between blocks (i.e., the 
blocking factor) do not bias your results for the factor effects 
of interest. This is accomplished by treating the blocking 
factor as another factor in the design. Consequently, you 
“lose” another interaction effect to the blocking factor, and 
the resultant design will be of lower resolution. However, 
these designs often have the advantage of being statistically 
more powerful, because they allow you to estimate and 
control the variability in the production process that is due 
to differences between blocks. 

Replicating the Design 

It is sometimes desirable to replicate the design, that is, 

to run each combination of factor levels in the design more 

than once. This will allow you to later estimate the so- 

called pure error in the experiment. The analysis of 

experiments is further discussed below; however, it should 

be clear that, when replicating the design, one can compute 

the variability of measurements within each unique 

combination of factor levels. This variability will give an 

indication of the random error in the measurements (e.g., 

due to uncontrolled factors, unreliability of the measurement 
• 

instrument, etc.), because the replicated observations are 
taken under identical conditions (settings of factor levels). 
Such an estimate of the pure error can be used to evaluate 
the size and statistical significance of the variability that 
can be attributed to the manipulated factors. 
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Partial replications: When it is not possible or feasible 
to replicate each unique combination of factor levels (i.e. 
the full design), one can still gain an estimate of pure error 
by replicating only some of the runs in the experiment. 
However, one must be careful to consider the possible bias 
that may be introduced by selectively replicating only some 
runs. If one only replicates those runs that are most easily 
repeated (e.g ., gathers information at the points where it is 
“cheapest”), one may inadvertently only choose those 
combinations of factor levels that happen to produce very 
little (or very much) random variability—causing one to 
underestimate (or overestimate) the true amount of pure 
error. Thus, one should carefully consider, typically based 
on your knowledge about the process that is being studied, 
which runs should be replicated, that is, which runs will 
yield a good (unbiased) estimate of pure error. 

Adding Center Points 


Designs with factors that are set at two levels implicitly 
assume that the effect of the factors on the dependent 
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variable of interest {e.g. , fabric Strength ) is linear. It is 
impossible to test whether or not there is a non-linear ( e.g 
quadratic) component in the relationship between factors A 
and a dependent variable, if A is only evaluated at two 
points ( i.e ., at the low and high settings). If one suspects 
that the relationship between the factors in the design and 
the dependent variable is rather curve-linear, then one 
should include one or more runs where all (continuous) 
factors are set at their midpoint. Such runs are called center- 
point runs (or center points), since they are, in a sense, in 
the center of the design (see graph). 

Later in the analysis (see below), one can compare the 
ipeasurements for the dependent variable at the center 
point with the average for the rest of the design. This 
provides a check for curvature (see Box and Draper, 1987): 
If the mean for the dependent variable at the center of the 
design is significantly different from the overall mean at 
all other points of the design, then one has good reason to 
believe that the simple assumption that the factors are 
linearly related to the dependent variable, does not hold. 

Analyzing the Results of a 2**(k-p) Experiment 

Analysis of variance. Next, one needs to determine 
exactly which of the factors significantly affected the 
dependent variable of interest. For example, in the study 
reported by Box and Draper (1987, page 115), it is desired 
to learn which of the factors involved in the manufactuie 
of dyestuffs affected the strength of the fabiic. In 
this example, factors 1 (Polysulfide), 4 (Time), and 6 
(Temperature) significantly affected the strength of the 
fabric. Note that to simplify matters, only main effects 
are shown below. 
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ANOVA; Var.rSTRENGTH; R-sqr = .60614; Adj:.56469 

(fabrico.sta) 

■ 

2**(6-0) design; MS Residual = 3.62509 




DV: STRENGTH 



SS 

df 

MS 

F 

P 

(1) POLYSUFD 

48.8252 

1 

48.8252 

13.46867 

.000536 

(2) REFLUX 

7.9102 

1 

7.9102 

2.18206 

.145132 

(3) MOLES 

.1702 

1 

.1702 

.04694 

.829252 

(4) TIME 

142.5039 

1 

142.5039 

39.31044 

.000000 

(5) SOLVENT 

2.7636 

1 

2.7639 

.76244 

.386230 

(6) TEMPERTR 

115.8314 

1 

115.8314 

31.95269 

.000001 

Error 

206.63.02 

57 

3.6251 



Total SS 

524.6348 

63 





Pure error and lack of fit: If the experimental design 
is at least partially replicated, then one can estimate the 
error variability for the experiment from the variability of 
the replicated runs. Since those measurements were taken 
under identical conditions, that is, at identical settings of 
the factor levels, the estimate of the error variability from 
those runs is independent of whether or not the “true” model 
is linear or non-linear in nature, or includes higher-order 
interactions. The error variability so estimated represents 
pure error , that is, it is entirely due to unreliabilities in the 
m easurement of the dependent variable. If available, one 
can use the estimate of pure error to test the significance of 
the residual variance, that is, all remaining variability that 
cannot be accounted for by the factors and their interactions 
that are currently in the model. If, in fact, the residual 
variability is significantly larger than the pure error 
variability, then one can conclude that there is still some 
statistically significant variability left that is attributable 
to differences between the groups, and hence, that there is 
an overall lack of fit of the current model. 
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ANOVA; Vai 

r.:STRENGTH; R-sqr = .58547; Adj:.56475 
(fabrico.sta) 


2**(3-0) design; MS Pure Error = 3.594844 

DV: STRENGTH 


SS 

df 

MS 

F 

P 

(1) POLYSUFD 

48.8252 

1 

48.8252 

13.58200 

.000517 

(2) TIME 

142.5039 

1 

142.5039 

39.64120 

.000000 

(3) TEMPERTR 

115.8314 

1 

115.8314 

32.22154 

.000001 

Lack of Fit 

Pure Error 

Total SS 

16.1631 

201.3113 

524.6348 

4 

56 

63 

4.0408 

3.5948 

1.12405 

.354464 


For example, the table above shows the results for the 
three factors that were previously identified as most 
important in their effect on fabric strength; all other factors 
wheie ignored in the analysis. As you can see in the row 
with the label Lack of Fit , when the residual variability for 
this model (i i.e ., after removing the three main effects) is 
compared against the pure error estimated from the within- 
group variability, the resulting F test is not statistically 
significant. Therefore, this result additionally supports the 
conclusion that, indeed, factors Polysulfide, Time, and 
Temperature significantly affected resultant fabric strength 
in an additive manner {i.e., there are no interactions). Or, 
put another way, all differences'between the means obtained 
in the different experimental conditions can be sufficiently 
explained by the simple additive model for those three 
variables. Parameter or effect estimates. Now, look at how 
these factors affected the strength of the fabrics. 

The numbers below are the effect or parameter estimates. 
With the exception of the overall Mean/Intercept, these 
estimates are the deviations of the mean of the negative 
settings from the mean of the positive settings for the 
les Pective factor. For example, if you change the setting of 
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factor Time from low to high, then you can expect an 
improvement in Strength by 2.98; if you set the value for 
factor Polysulfd to its high setting, you can expect a further 
improvement by 1.75, and so on. 



Effect 

Std.Err. 

t (57) 

P 

Mean/Interc. 

11.12344 

.237996 

46.73794 

.000000 

(1) POLYSUFD 

1.74688 

.475992 

3.66997 

.000536 

(2) REFLUX 

.70313 

.475992 

1.47718 

.145132 

(3) MOLES 

.103132 

.475992 

.21665 

.829252 

(4) TIME 

.98438 

.475992 

6.26980 

.000000 

(5) SOLVENT 

-.41562 

.475992 

-.87318 

.386230 

(6) TEMPERTR 

2.69062 

.475992 

5.65267 1 

.000001 


As you can see, the same three factors that were 
statistically significant show the largest parameter 
estimates; thus the settings of these three factors were most 
important for the resultant strength of the fabric. 

For analyses including interactions, the interpretation 
of the effect parameters is a bit more complicated. 
Specifically, the two-way interaction parameters are defined 
as half the difference between the main effects of one factor 
at the two levels of a second factor (see Mason, Gunst, and 
Hess, 1989, page 127); likewise, the three-way interaction 
parameters are defined as half the difference between the 
two-factor interaction effects at the two levels of a third 
factor, and so on. 

Regression coefficients: One can also look at the 
parameters in the multiple regression model (see Multiple 
Regression). To continue this example, consider the following 
prediction equation: 

Strength = const + b 1 % +... + b 6 *x 6 

Here x 1 through x 6 stand for the 6 factors in the analysis. 
The Effect Estimates shown earlier also contains these 
parameter estimates: 
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Coeff, 

■3S8 

-95. % 
Cnf. Limt 

+95. % 
Cnf. Limt 

Mean/Interc. 

11.12344 

-r 

.237996 

10.644686 

.000000 

(1) POLYSUFD 

1.74688 

.475992 

3.66997 

.000536 

(2) REFLUX 

.70313 

.475992 

1.47718 

.145132 

(3) MOLES 

.103132 

.475992 

.21665 

.829252 

(4) TIME 

.98438 

.475992 

6.26980 

.000000 

(5) SOLVENT 

-.41562 

.475992 

-.87318 

.386230 

(6) TEMPERTR 

2.69062 

.475992 

5.65267 

.000001 


Actually, these parameters contain little “new” infor¬ 
mation, as they simply are one-half of the parameter values 
(except for the Mean /Intercept) shown earlier. This makes 
sense since now; the coefficient can be interpreted as the 
deviation of the high-setting for the respective factors from 
the center. However, note that this is only the case if the 
factor values (i.e., their levels) are coded as -1 and +1, 
respectively. Otherwise, the scaling of the factor values will 
affect the magnitude of the parameter estimates. In the 
example data reported by Box and Draper (1987, page 115), 
the settings or values for the different factors were recorded 
on very different scales. 

Because the metric for the different factors is no longer 
compatible, the magnitudes of the regression coefficients 
are not compatible either. This is why it is usually more 
informative to look at the ANOVA parameter estimates (for 
the coded values of the factor levels), as shown before. 
However, the regression coefficients can be useful when 
one wants to make predictions for the dependent variable, 
based on the original metric of the factors. 

Graph Options 

Diagnostic plots of residuals. To start with, before 
accepting a particular “model” that includes a particular 
number of effects ( e.g ., main effects for Poly sulfide, Time, 
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and Temperature in the current example), one should always 
examine the distribution of the residual values. These are 
computed as the difference between the predicted values (as 
predicted by the current model) and the observed values. 
You can compute the histogram for these residual values, as 
well as probability plots (as shown below). 



The parameter estimates and ANOVA table are based 
on the assumption that the residuals are normally 
distributed. The histogram provides one way to check 
(visually) whether this assumption holds. The so-called 
normal probability plot is another common tool to assess 
how closely a set of observed values (residuals in this case) 
follows a theoretical distribution. In this plot the actual 
residual values are plotted along the horizontal X-axis; the 
vertical Y-axis shows the expected normal values for the 
respective values, after they were rank-ordered. If all values 
fall onto a straight line, then one can be satisfied that the 
residuals follow the normal distribution. 

Pareto chart of effects: The Pareto chart of effects is 
often an effective tool for communicating the results of an 
experiment, in particular to laymen. 



237 



In this graph, the ANOVA effect estimates are sorted 
from the largest absolute value to the smallest absolute 
value. The magnitude of each effect is represented by a 
column, and often, a line going across the columns indicates 
how large an effect has to be (i.e., how long a column must 
be) to be statistically significant. 


Normal probability plot of effects. Another useful, albeit 
more technical summary graph, is the normal probability 
plot of the estimates. As in the normal probability plot of 
the residuals, first the effect estimates are rank ordered, 
and then a normal z score is computed based on the 
assumption that the estimates are normally distributed. This 
z score is plotted on the Y-axis; the observed estimates are 
plotted on the X-axis (as shown below). 
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Square and cube plots. These plots are often used to 
summarize predicted values for the dependent variable, 
given the respective high and low setting of the factors. The 
square plot (see below) will show the predicted values (and, 
optionally, their confidence intervals) for two factors at a 
time. The cube plot will show the predicted values (and, 
optionally, confidence intervals) for three factors at a time. 



ft ttfcfeil U* v*.***.; 






Interaction plots. A general graph for showing the means 
is the standard interaction plot, where the means are 
indicated by points connected by lines. This plot (see below) 
is particularly useful when there are significant interaction 
effects in the model. 



*&T£ Si 4 Ei*r farnim temputsd fwm MS Ei*w»3jS2SB 
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Surface and contour plots Wh 

are continuous in nature, it i s T ^ factors in the design 
surface and contour plots of USeful to look at 

function of the factors. 6 epen ^ ent variable as a 



lift; 


5 dli 

«-:2 YM . 


*f**H*tr.*y 


t .. The t Se type ®° f P lots wil l further be discussed later in 
this section, in the context of 3**(k-p), and central composite 

and response surface designs. 


Summary 

2**(k-p) designs are the “workhorse” of industrial 
experiments. The impact of a large number of factors on the 
production process can simultaneously be assessed with 
relative efficiency (i.e., with few experimental runs). The 
logic of these types of experiments is straightforward (each 
factor has only two settings). Disadvantages. The simplicity 
of these designs is also their major flaw. As mentioned 
before, underlying the use of two-level factors is the belief 
that the resultant changes in the dependent variable (e.g., 
fabric strength) are basically linear in nature. This is often 
not the case, and many variables are related to quality 
characteristics in a non-linear fashion. In the example above, 
if you were to continuously increase the temperature factor 
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(which was significantly related to fabric strength), you 
would of course eventually hit a “peak,” and from there on 
the fabric strength would decrease as the temperature 
increases. While this types of curvature in the relationship 
between the factors in the design and the dependent variable 
can be detected if the design included center point runs, 
one cannot fit explicit nonlinear (e.g., quadratic) models 
with 2**(k-p) designs. 

r 

Another problem of fractional designs is the implicit 
assumption that higher-order interactions do not matter; 
but sometimes they do, for example, when some other factors 
are set to a particular level, temperature may be negatively 
related to fabric strength. Again, in fractional factorial 
designs, higher-order interactions (greater than two-way) 
particularly will escape detection. 

2**(k-p) Maximally Unconfounded and Minimum 
Aberration Designs 

Basic Idea 

2**(k-p) fractional factorial designs are often used in 
industrial experimentation because of the economy of data 
collection that they provide. For example, suppose an 
engineer needed to investigate the effects of varying 11 
factors, each with 2 levels, on a manufacturing process. Let 
us call the number of factors k, which would be 11 for this 
example. An experiment using a full factorial design, where 
the effects of every combination of levels of each factor are 
studied, would require 2**(k ) experimental runs, or 2048 
runs for this example. To minimize the data collection effort, 
the engineer might decide to forego investigation of higher- 
order interaction effects of the 11 factors, and focus instead 
on identifying the main effects of the 11 factors and any 
low-order interaction effects that could be estimated from 
an experiment using a smaller, more reasonable number of 
experimental runs. There is another, more theoretical reason 


Experimental Design (Industrial DOE) 


241 


for not conducting huge, full factorial 2 level experiments. 
In general, it is not logical to be concerned with identifying 
higher-order interaction effects of the experimental factors, 
while ignoring lower-order nonlinear effects, such as 
quadratic or cubic effects, which cannot be estimated if only 
2 levels of each factor are employed. So although practical 
considerations often lead to the need to design experiments 
with a reasonably small number of experimental runs, there 
is a logical justification for such experiments. 

The alternative to the 2**(k) full factorial design is the 
2**(k-p) fractional factorial design, which requires only a 
“fraction” of the data collection effort required for full 
factorial designs. For our example with k=ll factors, if only 
64 experimental runs can be conducted, a 2**(ll-5) fractional 
factorial experiment would be designed with 2**6 = 64 
experimental runs. In essence, a k-p = 6 way full factorial 
experiment is designed, with the levels of the p factors being 
“generated” by the levels of selected higher order interactions 
of the other 6 factors. Fractional factorials “sacrifice” higher 
order interaction effects so that lower order effects may still 
be computed correctly. However, different criteria can be 
used in choosing the higher order interactions to be used as 
generators, with different criteria sometimes leading to 
different “best” designs. 

2**(k-p) fractional factorial designs can also include 
blocking factors. In some production processes, units are 
produced in natural “chunks” or blocks. To make sure that 
these blocks do not bias your estimates of the effects for the 
k factors, blocking factors can be added as additional factors 
in the design. Consequently, you may “sacrifice” additional 
interaction effects to generate the blocking factors, but these 
designs often have the advantage of being statistically more 
powerful, because they allow you to estimate and control 
the variability in the production process that is due to 
differences between blocks. 
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Design Criteria 

Many of the concepts discussed in this overview are 
also addressed in the Overview of 2**(k-n ) , 

factorial design*. ^ 

lactional factorial des lg ns are constructed is beyond the 
scope of either introductory overview. Detailed accounts of 
how to design 2 *(k-p) experiments can be found for 
example, m Bayne and Rubin (1986), Box and Draper (1987) 

U n ^? ter <1978)> Montgorae >Y (1991), Daniel 

' De ” mg * nd (1993), Mason, Gunst, and Hess 

( 989), oi Ryan (1989), to name only a few of the many text 
books on this subject. 


In general, the 2**(k-p) maximally unconfounded and 
minimum aberration designs techniques will successively 
select which higher-order interactions to use as generators 
for the p factors. For example, consider the following design 
that includes 11 factors but requires only 16 runs 
(observations). 


Design: 2**(7), Resolution III 

Run 

A 

B 

C 

D 

E 

F 

G 

H 

I 

J 

K 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

1 

1 

-1 

1 

-1 

-1 

-1 

-1 

1 

1 

3 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

4 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

5 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

6 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

7 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

8 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

9 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

10 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

11 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

12 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

13 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

14 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

15 

1 . 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

16 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 
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Interpreting the design. The 
Scrollsheet above should be into ’ g j <1Sp ayeci in the 
column contains +1 * 8 „! J as follows. Each 

respective factor (high or low, respertiVeMSoT 1 " 8 ° f T 

" ^"t't'tVT °t ^ eXPei ' lment ' all factors /thmugE J 
are set to the htgher level, and in the second run, factors A 

B and C aie set to the higher level, but factor D is set to 

the lower level, and so on. Notice that the settings foi each 

experimental run for factor E can be produced by multiplying 

the respective settings for factors A, B, and C. The Ax B x 

C interaction effect therefore cannot be estimated 

independently of the factor E effect in this design because 

these two effects are confounded. Likewise, the settings for 

factor F can be produced by multiplying the respective 

settings for factors B, C, and D. We say that ABC and BCD 

are the generators for factors E and F, respectively 


The maximum resolution design criterion: In the 
Scrollsheet shown above, the design is described as a 2**(7) 
design of resolution III (three). This means that you study 
overall k - 11 factors, but p = 7 of those factors were 
generated from the interactions of a full 2** [(7) = 4 ] factorial 
design. As a result, the design does not give full resolution ; 
that is, there are certain interaction effects that are 
confounded with (identical to) other effects. In general, a 
design of resolution R is one where no l -way interactions* 
are confounded with any other interaction of order less than 
^ In the current example, R is equal to 3. Here, no l = 1- 
way interactions (i.e., main effects) are confounded with 
any other interaction of order less than R - l = 3-1 = 2. 
Thus, main effects in this design are unconfounded with 
each other, but are confounded with two-factor interactions; 
and consequently, with other higher-order interactions. One 
obvious, but nevertheless very important overall design 
criterion is that the higher-order interactions to be used as 


generators should be chosen such that the resolution of the 
design is as high as possible. 
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The maximum unconfounding design criterion: 

Maximizing the resolution of a design, however, does not by 
itself ensure that the selected generators produce the “best” 
design. Consider, for example, two. different resolution IV 
designs. In both designs, main effects would be unconfounded 
with each other and 2-factor interactions would be 
unconfounded with main effects, i.e, no l = 2-way interactions 
are confounded with any other interaction of order less than 
R -1 — 4—2 — 2. The two designs might be different, however, 
with regard to the degree of confounding for the 2-factor 
interactions. For resolution IV designs, the “crucial order,” 
in which confounding of effects first appears, is for 2-factor 
interactions. In one design, none of the “crucial order,” 2- 
factor interactions might be unconfounded with all other 2- 
factor interactions, while in the other design, virtually all 
of the 2-factor interactions might be unconfounded with all 
of the other 2-factor interactions. The second “almost 
resolution V” design would be preferable to the first “just 
barely resolution IV” design. This suggests that even though 
the maximum resolution design criterion should be the 
primary criterion, a subsidiary criterion might be that 
generators should be chosen such that the maximum number 
of interactions of less than or equal to the crucial order, 
given the resolution, are unconfounded with all other 
interactions of the crucial order. This is called the maximum 
unconfounding design criterion, and is one of the optional, 
subsidiary design criterion to use in a search for a 2**(k-p) 
design. 

The minimum aberration design criterion: The 
minimum aberration design criterion is another optional, 
subsidiary criterion to use in a search for a 2**(k-p) design. 
In some respects, this criterion is similar to the maximum 
unconfounding design criterion. Technically, the minimum 
aberration design is defined as the design of maximum 
resolution “which minimizes the number of words in the 
defining relation that are of minimum length (Fries & 
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Hunter, 1984). Less technically, the criterion apparently 
operates by choosmg generators that produce the smallest 
numbei of pans of confounded interactions of the crucial 
order. For example, the minimum aberration resolution IV 
design would have the minimum number of pairs of 
confounded 2-tactor interactions. 


T ° iUustrate the difference between the maximum 
uncdnfoundmg and minimum ab^tion criteria, consider 
the maximally unconfounded 2**(4) design and the minimum 
aberration 2**(9-4) design, as for example, listed in Box 
Hunter, and Hunter (1978). If you compare these two 
designs, you will find that in the maximally unconfounded 
design, 15 of the 36 2-factor interactions are unconfounded 
with any other 2-factor interactions, while in the minimum 
aberration design, only 8 of the 36 2-factor interactions are 
unconfounded with any other 2-factor interactions. The 
minimum aberration design, however, produces 18 pairs of 
confounded interactions, while the maximally unconfounded 
design produces 21 pairs of confounded interactions. So, the 
two criteria lead to the selection of generators producing 
different “best” designs. g 


Fortunately, the choice of whether to use the maximum 
unconfounding criterion or the minimum aberration criterion 
makes no difference in the design which is selected (except 
for, perhaps, relabeling of the factors) when there are 11 or 
fewer factors, with the single exception of the 2**(9-4) design 
described above (see Chen, Sun, & Wu, 1993). For designs 
with more than 11 factors, the two criteria can lead to the 
selection of very different designs, and for lack of better 
advice, we suggest using both criteria, comparing the designs 
that are produced, and choosing the design that best suits 
your needs. We will add, editorially, that maximizing the 
number of totally unconfounded effects often makes more 
sense than minimizing the number of pairs of confounded 

effects. 


246 


Statistical Methods in Applied Biol 


°gy 


Summary 

2**(k-p) fractional factorial designs are probably the most 
frequently used type of design in industrial experimentation. 
Things to consider in designing any 2**(k-p) fractional 
factorial experiment include the number of factors to be 
investigated, the number of experimental runs, and whether 
there will be blocks of experimental runs. Beyond these 
basic considerations, one should also take into account 
whether the number of runs will allow a design of the 
required resolution and degree of confounding for the crucial 
order of interactions, given the resolution. 

3**(k-p), Box-Behnken, and Mixed 2 and 3 Level 
Factorial Designs 

Overview 

In some cases, factors that have more than 2 levels 
have to be examined. For example, if one suspects that the 
effect of the factors on the dependent variable of interest is 
not simply linear, then, as discussed earlier, one needs at 
least 3 levels in order to test for the linear and quadratic 
effects (and interactions) for those factors. Also, sometimes 
some factors may be categorical in nature, with more than 
2 categories. For example, you may have three different 
machines that produce a particular part. 

Designing 3**(k-p) Experiments 

The general mechanism of generating fractional factorial 
designs at 3 levels (3**(k-p) designs) is very similar to that 
described in the context of 2**(k-p) designs. Specifically, 
one starts with a full factorial design, and then uses the 
interactions of the full design to construct “new” factors (or 
blocks) by making their factor levels identical to those for 
the respective interaction terms ( i.e ., by making the new 
factors aliases of the respective interactions). 

For example, consider the following simple 3**(3-l) 
factorial design: 
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3**(3-l) fractional factorial design, 

1 block , 9 runs 

Standard Run 

A 

B 

C 

1 

0 

0 

0 

2 

0 

1 

2 

3 

0 

2 

1 

4 

1 

0 

2 

5 

1 

1 

1 

6 

1 

2 

0 

7 

2 

0 

1 

8 

2 

1 

0 

9 

2 

2 

2 


As in the case of 2**(k-p) designs, the design is 
constructed by starting with the full 3-1-2 factorial design; 
those factors are listed in the first two columns (factors A 
and B). Factor C is constructed from the interaction AB of 
the first two factors. Specifically, the values for factor C are 
computed a s 

C = 3 - mod 3 (A+B) 

Here, niod/x) stands for the so-called modulo-3 operator, 
which will first find a number y that is less than or equal to 
x, and that is evenly divisible by 3, and then compute the 
difference (remainder) between number y and x. For example, 
mod 3 (0) is equal to 0 , mod 3 (l) is equal to 1, mod 3 (3) is equal 
to 0, mod 3 (5) is equal to 2 (3 is the largest number that is 
less than or equal to 5, and that is evenly divisible by 3\ 
finally, 5-3=2), and so on. 

Fundamental identity. If you apply this function to the 
sum of columns A and B shown above, you will obtain the 
third column C. Similar to the case of 2**(k-p) designs (see 
2**(k-p) designs for a discussion of the fundamental identity 
ln fhe context of 2**(k-p) designs), this confounding of 
interactions with “new” main effects can be summarized in 
an expression: 
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0 = mod 3 (A+B+C) 

If you look back at the 3**(3-l) design shown earlier, 
you will see that, indeed, if you add the numbers in the 
three columns they will all sum to either 0, 3, or 6, that is, 
values that are evenly divisible by 3 (and hence: 
mod 3 (A+B+C)=0). Thus, one could write as a shortcut 
notation ABOO, in order to summarize the confounding of 
factors in the fractional 3**(k-p) design. 

Some of the designs will have fundamental identities 
that contain the number 2 as a multiplier; e.g., 

0 = mod 3 (B+C*2+D+E*2+F) 

This notation can be interpreted exactly as before, that 
is, the modulo 3 of the sum B+2*C+D+2*E+F must be 
equal to 0. The next example shows such an identity. 

An Example 3**(4-l) Design in 9 Blocks 

Here is the summary for a 4-factor 3-level fractional 

factorial design in 9 blocks that requires only 27 runs. 

SUMMARY: 3**(4-l) fractional factorial 

Design generators: ABCD 

Block generators: AB,AC2 

Number of factors (independent variables): 4 

Number of runs (cases, experiments): 27 

Number of blocks: 9 

This design will allow you to test for linear and quadratic 
main effects for 4 factors in 27 observations, which can be 
gatheied in 9 blocks of 3 observations each. The fundamental 
identity or design generator for the design is ABCD, thus 
the modulo, of the sum of the factor levels across the four 
actors is equal to 0. The fundamental identity also 
f_ i ° ws ^ ou determine the confounding of factors and 

and Anderson, 1984, 
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EXPERIM 
DESIGN 


--- — _ _ _* 


1 

2 

3 

4 

5 

6 

7 

8 


Unconf. Effects 
(excl. blocks) 


Unconfounded if 
blocks included? 


As you can see, in this 3**(1) design the main effect 
are not confounded with each other, even when the 
experiment is run in 9 blocks. 


Box-Behnken Designs 

*. he f 86 °! 2 ** (k : p) designs ' Placke tt and Burman 
(1946) developed highly fractionalized designs to screen the 

maximum number of (main) effects in the least number of 

experimental runs. The equivalent in the case of 3**(k-p) 

designs are the so-called Box-Behnken designs (Box and 

Behnken, 1960; see also Box and Draper, 1984). These 

designs do not have simple design generators (they are 

constructed by combining two-level factorial designs with 

incomplete block designs), and have complex confounding of 

interaction. However, the designs are economical and 

therefore particularly useful when it is expensive to perform 

the necessary experimental runs. 

Analyzing the 3**(k-p) Design 

The analysis of these types of designs proceeds basically 
m the same way as was described in the context of 2**(k-p) 
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i • However for each effect, one can now test for the 
designs- ^ quadratic (non .li n ear effect). For 

linear e gtudying the yield of chemical process, then 

?““erature may be related in a non-linear fashion, that is, 
^maximum yield may be attained when the temperature 
is'seTat the medium level. Thus, non-hnearity often occurs 
when a process performs near its optimum. 


ANOVA Parameter Estimates 

To estimate the ANOVA parameters, the factors levels 
for the factors in the analysis are internally recoded so that 
one can test the linear and quadratic components m the 
relationship between the factors and the dependent variable. 
Thus, regardless of the original metric of factor settings 
(eg., 100 degrees C, 110 degrees C, 120 degrees C) you can 
always recode those values to -1, 0, and +1 to perform the 
computations. The resultant ANOVA parameter estimates 
can be interpreted analogously to the parameter estimates 
for 2**(k-p) designs. 


For example, consider the following ANOVA results: 


Factor 

Effect 

Std. Err 

t (69) 

P 

Mean/Intere. 

103.6942 

.390591 

265.4805 

0.000000 

BLOCKS© 

.8028 

1.360542 

.5901 

.557055 

BLOCKS© 

-1.2307 

1.291511 

-.9529 

.343952 

(l)TEMPERAT(L) 

-.3245 

.977778 

-.3319 

.740991 

TEMPERAT (Q) 

-.5111 

.809946 

-.6311 

.530091 

(2)TIME (L) 

.0017 

.977778 

.0018 

.998589 

TIME (Q) 

,0045 

.809946 

.0056 

.995541 

©SPEED (L) 

-10.3073 

.977778 

-10.5415 

.000000 

SPEED (Q) 

-3.7915 

.809946 

-4.6812 

.000014 

1L by 2L 

3.9256 

1.540235 

2.5487 

.013041 

1L by 2Q 

.4384 

1.371941 

.3195 

.750297 

IQ by 2L 

.4747 

1.371941 

.3460 

.730403 

IQ by 2Q 

-2.7499 

.995575 

_ 

-2.7621 

.007353 
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Mam-effect estimates. By default, the Effect estimate 
for the linear effects (marked by the L next to the factor 
name) can be interpreted as the difference between the 
average response at the low and high settings for the 
respective factors. The estimate for the quadratic (non-linear) 
effect (marked by the Q next to the factor name) can be 
interpreted as the difference between the average response 
at the center (medium) settings and the combined high and 
low settings for the respective factors. 


Interaction effect estimates: As in the case of 2**(k- 
p) designs, the linear-by-linear interaction effect can be 
interpreted as half the difference between the linear main 
effect of one factor at the high and low settings of another. 
Analogously, the interactions by the quadratic components 
can be inteipreted as half the difference between the 
quadiatic main effect of one factor at the respective settings 
of another, that is, either the high or low setting (quadratic 
by linear interaction), or the medium or high and low 
settings combined (quadratic by quadratic interaction). 


In practice, and from the standpoint of "interpretability 
of results,” one would usually try to avoid quadratic 
interactions. For example, a quadratic-by-quadratic A-by-B 
interaction indicates that the setting of B modifies the non- 
lineai effect of factor A in a nonlinear fashion. This means 
that there is a fairly complex interaction between factors 
present in the data that will make it difficult to understand 
and optimize the respective process. Sometimes, performing 
nonlinear transformations ( e.g ., performing a log 
transformation) of the dependent variable values can remedy 
the problem. 


Centered and non-centered polynomials: As 
mentioned above, the interpretation of the effect estimates 
applies only when you use the default parameterization of 
the model. In that case, the you will code the quadratic 
factor interactions, so that they become maximally 
untangled” from the linear main effects. 
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Graphical Presentation of Results 

The same diagnostic plots (e.g., of residuals) are available 
for 3**(k-p) designs as were described in the context of 
2**(k-p) designs. Thus, before interpreting the final results, 
one should always first look at the distribution of the 
residuals for the final fitted model. The ANOVA assumes 
that the residuals (errors) are normally distributed. 


Plot of means: When an interaction involves categorical 
factors (e.g., type of machine, specific operator of machine, 
and some distinct setting of the machine), then the best 
way to understand interactions is to look at the respective 
interaction plot of means. 


Surface plot: When the factors in an interaction are 
continuous in nature, you may want to look at the surface 
plot that shows the response surface applied by the fitted 
model. Note that this graph also contains the prediction 
equation (in terms of the original metric of factors),that 
produces the respective response surface. 
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Designs for Factors at 2 and 3 Levels 

You can also generate standard designs with 2 and 3 
level factors. Specifically, you can generate the standard 
designs as enumerated by Connor and Young for the US 
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National Bureau of Standards (see McLean and Anderson, 
1984). The technical details of the method used to generate 
these designs aie beyond the scope of this introduction. 
Ilowevei, in geneial the technique is, in a sense, a 
combination of the procedures described in the context of 
2**(k-p) and 3 (k-p) designs. It should be noted however, 
that, while all of these designs are very efficient, they are 
not necessarily orthogonal with respect to all main effects. 
This is, however, not a problem, if one uses a general 
algorithm for estimating the ANOVA parameters and sums 
of squares that does not require orthogonality of the design. 

The design and analysis of these experiments proceeds 
along the same lines as discussed in the context of 2** (k-p) 
and 3**(k-p) experiments. 


CENTRAL COMPOSITE AND NON-FACTORIAL 
RESPONSE SURFACE DESIGNS 


Overview 


The 2**(k-p) and 3**(k-p) designs all require that the 
levels of the factors are set at, for example, 2 or 3 levels. In 
many instances, such designs are not feasible, because, for 
example, some factor combinations are constrained in some 
way (e.g., factors A and B cannot be set at their high levels 
simultaneously). Also, for reasons related to efficiency, which 
will be discussed shortly, it is often desirable to explore the 
experimental region of interest at particular points that 
cannot be represented by a factorial design. 

The designs (and how to analyze them) discussed in 
this section all pertain to the estimation (fitting) of response 
surfaces, following the general model equation: 


y - b 0 +bj *x, +...+b k *x t + b 12 *x, *X 2 +b ,3 *Xj *X 3 +...+b k , k 

x k-i X + b„ *x, 2 +...+b kk *x k * 

Put into words, one is fitting a model to the observed 
values of the dependent variable y, that include (1) main 
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effects for factors x 1 , x k , (2) their interactions (x*x 2 , 
x*x 3 ,... ,x k .*x k ), and (3) their quadratic components (x**2, 

x k **2). No assumptions are made concerning the “levels” 
of the factors, and you can analyze any set of continuous 
values for the factors. 

There are some considerations concerning design 
efficiency and biases, which have led to standard designs 
that are ordinarily used when attempting to fit these 
response surfaces, and those standard designs will be 
discussed shortly ( e.g. } see Box, Hunter, and Hunter, 1978; 
Box and Draper, 1987; Khuri and Cornell, 1987; Mason, 
Gunst, and Hess, 1989; Montgomery, 1991). But, as will be 
discussed later, in the context of constrained surface designs 
and D- and A-optimal designs, these standard designs can 
sometimes not be used for practical reasons. However, the 
central composite design analysis options do not make any 
assumptions about the structure of your data file, that is, 
the number of distinct factor values, or their combinations 
across the runs of the experiment, and, hence, these options 
can be used to analyze any type of design, to fit to the data 
the general model described above. 

Design Considerations 

Orthogonal designs. One desirable characteristic of any 
design is that the main effect and interaction estimates of 
interest are independent of each other. For example, suppose 
you had a two- factor experiments, with both factors at two 
levels. Your design consists of four runs: 



A 

B 

Run 1 

1 

1 

Run 2 

1 

1 

Run 3 

-1 

-1 

Run 4 

-1 

-1 


For the first two runs, both factors A and B are set at 
their high levels (+i). In the last two runs, both are set at 
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their low levels (-/). Suppose you wanted to estimate the 
independent contributions of factors A and B to the 
prediction of the dependent variable of interest. Clearly this 
is a silly design, because there is no way to estimate the A 
main effect and the B main effect. One can only estimate 
one effect—the difference between Runs 1+2 vs. Runs 3+4— 

, which represents the combined effect of A and B. 

The point here is that, in order to assess the independent 
contributions of the two factors, the factor levels in the four 
runs must be set so that the columns” in the design (under 
A and B in the illustration above) are independent of each 
other. Another way to express this requirement is to say 
that the columns of the design matrix (with as many columns 
as there are main effect and interaction parameters that 
one wants to estimate) should be orthogonal (this term was 
first used by Yates, 1933). For example, if the four runs in 
the design are arranged as follows: 



A 

— 

B 

Run 1 

1 

1 

Run 2 

1 

-1 

Run 3 

-1 

1 

Run 4 

-1 

-1 


then the A and B columns are orthogonal. Now you can 
estimate the A main effect by comparing the high level for 
A within each level of B, with the low level for A within 
each level of B\ the B main effect can be estimated in the 
same way. Technically, two columns in a design matrix are 
orthogonal if the sum of the products of their elements 
within each row is equal to zero. In practice, one often 
encounters situations, for example due to loss of some data 
in some runs or other constraints, where the columns of the 
design matrix are not completely orthogonal. In general, 
the rule here is that the more orthogonal the columns are, 
the better the design, that is, the more independent 
information can be extracted from the design regarding the 
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respective effects of interest. Therefore, one consideration 
for choosing standard central composite designs is to find 
designs that are orthogonal or near orthogonal. 

Rotatable designs. The second consideration is related 
to the first requirement, in that it also has to do with how 
best to extract the maximum amount of (unbiased) 
information from the design, or specifically, from the 
experimental region of interest. Without going into details 
(see Box, Hunter, and Hunter, 1978; Box and Draper, 1987, 
see also Deming and Morgan, 1993, Chapter 13), it can be 
shown that the standard error for the prediction of dependent 
variable values is proportional to: 

(i + f( X )’ * (x’xr * f(x ))**‘/2 

where f(x) stands for the (coded) factor effects for the 
respective model (f(x) is a vector, f(x)’ is the transpose of 
that vector), and X is the design matrix for the experiment, 
that is, the matrix of coded factor effects for all runs; XX**- 
1 is the inverse of the cross product matrix. Deming and 
Morgan (1993) refer to this expression as the normalized 
uncertainty ; this function is also related to the variance 
function as defined by Box and Draper (1987). The amount 
of uncertainty in the prediction of dependent variable values 
depends on the variability of the design points, and their 
covariance over the runs. The point here is that, again, one 
would like to choose a design that extracts the most 
information regarding the dependent variable, and leaves 
the least amount of uncertainty for the prediction of future 
values. It follows, that the amount of information (or 
normalized information according to Deming and Morgan, 
1993) is the inverse of the normalized uncertainty. 

For the simple 4-run orthogonal experiment shown 
earlier, the information function is equal to 

I x = 4/(1 + Xj 2 + x 2 2 ) 

where Xj and x 0 stand for the factor settings for factois A 
and B, respectively (see Box and Draper, 1987). 



Inspection of this function in a plot (see above) shows 
that it is constant on circles centered at the origin. Thus 
any kind of rotation of the original design points will 
generate the same amount of information, that is, generate 
the same information function. Therefore, the 2-by-2 

orthogonal designs in 4 runs shown earlier is said to be 
rotatable. 


As pointed out before, in order to estimate the second 
order, quadratic, or non-linear component of the relationship 
between a factor and the dependent variable, one needs at 
least 3 levels for the respective factors. What does the 
information function look like for a simple 3-by-3 factorial 
design, for the second-order quadratic model as shown at 
the beginning of this section? 
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As it turns out, this function looks more complex, 
contains “pockets” of high-density information at the edges 
(which are probably of little particular interest to the 
experimenter), and clearly it is not constant on circles around 
the origin. Therefore, it is not rotatable; meaning different 
rotations of the design points will extract different amounts 
of information from the experimental region. 

Star-points and rotatable second-order designs: It 
can be shown that by adding so-called star-points to the 
simple (square or cube) 2-level factorial design points, one 
can achieve rotatable, and often orthogonal or nearly 
orthogonal designs. For example, adding to the simple 2-by- 
2 orthogonal designs shown earlier the following points, 
will produce a rotatable design. 

The first four runs in this design are the previous 2-by- 
2 factorial design points (or square points or cube points ); 
runs 5 through 8 are the so-called star points or axial points, 
and runs 9 and 10 are center points. 



A 

B 

Run 1 

1 

1 

Run 2 

1 

-1 

Run 3 

-1 

1 

Run 4 

-1 

-1 

Run 5 

-1.414 

0 

Run 6 

1.414 

0 

Run 7 

0 

-1.414 

Run 8 

0 

1.414 

Run 9 

0 

0 

Run 10 

0 

0 


The information function for this design for the second- 
order (quadratic) model is rotatable, that is, it is constant 
on the circles around the origin. 
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Alpha for Rotatability and Orthogonality 

The two design characteristics discussed so far— 
orthogonality and rotatability—depend on the number of 
center points in the design and on the so-called cmaZ 
distance a (alpha), which is the distance of the star points 
from the center of the design (i.e., 1.414 in the design shown 
above). It can be shown that a design is rotatable if: 

a = (n c ) V4 
a 

where n stands for the number of cube points in the design 
(i.e., points in the factorial portion of the design). 

A central composite design is orthogonal, if one chooses 
the axial distance so that: 

a = {[( n c + n s + n 0 ) Vi - n/ 2 ] 2 * n J4\ Va 
where 

n c is the number of cube points in the design 
n 8 is the number of star points in the design 
is the number of center points in the design 
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To make a design both (approximately) orthogonal and 
rotatable, one would first choose the axial distance for 
rotatability, and then add center points (see Kkuri and 
Cornell, 1987), so that: 

n 0 a4*n c * + 4 - 2k 

where k stands for the number of factors in the design. 

Finally, if blocking is involved, Box and Draper (1987) 
give the following formula for computing the axial distance 
to achieve orthogonal blocking, and in most cases also 
reasonable information function contours, that is, contours 
that are close to spherical: 

a= [k*(l+n s0 /n s )/(l+n cn /n c )]" 

where 

n s0 is the number of center points in the star portion of 
the design 

n s is the number of non-center star points in the design 

n c0 is the number of center points in the cube portion of 
the design 

n is the number of non-center cube points in the design 

C 

Available Standard Designs 

The standard central composite designs are usually 
constructed from a 2**(k-p) design for the cube portion of 
the design, which is augmented with center points and star 
points. Box and Draper (1987) list a number of such designs. 

Small composite designs: In the standard designs, 
the cube portion of the design is typically of resolution V (or 
higher). This is, however, not necessary, and in cases when 
the experimental runs are expensive, or when it is not 
necessary to perform a statistically powerful test of 
model adequacy, then one could choose for the cube portion 
designs of resolution III. 
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Analyzing Central Composite Designs 

The analysis of central composite designs proceeds in 
much the same way as for the analysis of 3**(k-p) designs. 
You fit to the data the general model described above; for 
example, for two variables you would fit the model: 

y - b o + V x i + V** + *u V*.+*ii V+ b 22 v 

The Fitted Response Surface 

The shape of the fitted overall response can best be 
summarized in graphs and you can generate both contour 
plots and response surface plots (see examples below) for 
the fitted model. 


Categorized Response Surfaces 

You can fit 3D surfaces to your data, categorized by 
some other variable. For example, if you replicated a 
standard central composite design 4 times, it may * e 
informative to see how similar the surfaces are w en e 
to each replication. 
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This would give you a graphical indication of the 
reliability of the results and where (e.g., in which region of 
the surface) deviations occur. 
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Clearly, the third replication produced a different 
surface. In replications 1, 2, and 4, the fitted surfaces are 
very similar to each other. Thus, one should investigate 
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what could have caused this noticeable difference in the 
third replication of the design. 


LATIN SQUARE DESIGNS 

Overview 

Latin square designs (the term Latin square was first 
used by Euler, 1782) are used when the factors of interest 
have more than two levels and you know ahead of time that 
there are no (or only negligible) interactions between factors 
For example, if you wanted to examine the effect of 4 fuel 
additives on reduction in oxides of nitrogen and had 4 cars 
and 4 drivers at your disposal, then you could of course run 
a full 4x4x4 factorial design, resulting in 64 experimental 
runs. However, you are not really interested in any (minor) 
inteiactions between the fuel additives and drivers, fuel 
additives and cars, or cars and drivers. You are mostly 
interested in estimating main effects, in particular the one 
foi the fuel additives factor. At the same time, you want to 
make sure that the main effects for drivers and cars do not 

affect (bias) your estimate of the main effect for the fuel 
additive. 


If you labeled the additives with the letters A, B, C, and 
D, the Latin square design that would allow you to derive 
unconfounded main effects estimates could be summarized 
as follows: 



Car 

Driver 

1 

2 

3 

4 

1 

A 

B 

D 

C 

2 

D 

C 

A 

B 

3 

B 

D 

C 

A 

4 

C 

A 

B 

D 


Latin Square Design 



The example shown above is actually only one of the 
ee possible arrangements in effect estimates. These 
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“arrangements” are also called Latin square . The example 
above constitutes a 4 x 4 Latin square; and rather than 
requiring the 64 runs of the complete factorial, you can 
complete the study in only 16 runs. 

Greco-Latin square: A nice feature of Latin Squares 
is that they can be superimposed to form what are called 
Greco-Latin squares (this term was first used by Fisher and 
Yates, 1934). In the resultant Greco-Latin square design, 
you can evaluate the main effects of four 3-level factors 
(row factor, column factor, Roman letters, Greek letters) in 
only 9 runs. 

Hyper-Greco Latin square: For some numbers of 
levels, there are more than two possible Latin square 
arrangements. For example, there are three possible 
arrangements for 4-level Latin squares. If all three of them 
are superimposed, you get a Hyper-Greco Latin square 
design. In that design you can estimate the main effects of 
all five 4-level factors with only 16 runs in the experiment. 

Analyzing the Design 

Analyzing Latin square designs is straightforward. Also, 
plots of means can be produced to aid in the interpretation 

of results. 

Very Large Designs, Random Effects, Unbalanced 
Nesting 

Note that there are several other statistical methods 
that can also analyze these types of designs; see the section 
on Methods for Analysis of Variance for details. In particular 
the Variance Components and Mixed Model ANOVA/ 
ANCOVA chapter discusses very efficient methods for 
analyzing designs with unbalanced nesting (when the nested 
factors have different numbers of levels within the levels of 
the factors in which they are nested), very large nested 
designs (e.g., with more than 200 levels oveiall), oi 
hierarchically nested designs. 
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TAGUCHI METHODS: ROBUST DESIGN 
EXPERIMENTS 

Overview 

Applications . Taguchi methods have become increasingly 
popular in recent years. The documented examples of sizable 
quality impiovements that resulted from implementations 
of these methods (see, for example, Phadke, 1989; Noori, 
1989) have added to the curiosity among American 
manufacturers. In fact, some of the leading manufacturers 
in this country have begun to use these methods with usually 
great success. For example, AT&T is using these methods 
in the manufacture of very large-scale integrated (VLSI) 
circuits; also. Ford Motor Company has gained significant 
quality improvements due to these methods (American 
Supplier Institute, 1984 to 1988). However, as the details of 
these methods are becoming more widely known, critical 
appraisals are also beginning to appear (for example, Bhote, 
1988; Tribus and Szonyi, 1989). 

Overview: Taguchi robust design methods are set apart 
from traditional quality control procedures and industrial 
experimentation in various respects. Of particular 
importance are: 

1. The concept of quality loss functions, 

2. The use of signal-to-noise (S/N) ratios, and 

3. The use of orthogonal arrays. 

These basic aspects of robust design methods will be 
discussed in the following sections. Several books have 
recently been published on these methods, for example, 
Peace (1993), Phadke (1989), Ross (1988), and Roy (1990), 
to name a few, and it is recommended that you refer to 
those books for further specialized discussions. Introductory 
overviews of Taguchi’s ideas about quality and quality 
improvement can also be found in Barker (1986), Garvin 
(1987), Kackar (1986), and Noori (1989). 
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Quality and Loss Functions 

What is quality: Taguchi’s analysis begins with the 
question of how to define quality. It is not easy to formulate 
a simple definition of what constitutes quality; however 
when your new car stalls in the middle of a busy 
intersection—putting yourself and other motorists at risk— 
you know that your car is not of high quality. Put another 
way, the definition of the inverse of quality is rather 
straightforward: it is the total loss to you and society due to 
functional variations and harmful side effects associated 
with the respective product. Thus, as an operational 
definition, you can measure quality in terms of this loss, 
and the greater the quality loss the lower the quality. 

Discontinuous (step-shaped) loss function. You can 
formulate hypotheses about the general nature and shape 
of the loss function. Assume a specific ideal point of highest 
quality; for example, a perfect car with no quality problems. 
It is customary in statistical process control ( SPC; see also 
Process Analysis) to define tolerances around the nominal 
ideal point of the production process. According to the 
traditional view implied by common SPC methods, as long 
as you are within the manufacturing tolerances you do not 
have a problem. Put another way, within the tolerance limits 
the quality loss is zero; once you move outside the tolerances, 
the quality loss is declared to be unacceptable. Thus, 
according to traditional views, the quality loss function is a 
discontinuous step function : as long as you are within the 
tolerance limits, quality loss is negligible; when you step 
outside those tolerances, quality loss becomes unacceptable. 

Quadratic loss function: Is the step function implied 
by common SPC methods a good model of quality loss? 
Return to the “perfect automobile” example. Is there a 
difference between a car that, within one year after purchase, 
has nothing wrong with it, and a car where minor rattles 
develop, a few fixtures fall off, and the clock in the dashboard 
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breaks (all m-warranty repairs, mind you...)? If you ever 
bought a new car of the latter kind, you know very well 
how annoying those admittedly minor quality problems can 
be. The point here is that it is not realistic to assume that, 
as you move away from the nominal specification in your 
production piocess, the quality loss is zero as long as you 
stay within the set tolerance limits. Rather, if you are not 
exactly on target, then loss will result, for example in 
terms of customer satisfaction. Moreover, this loss is 
probably not a linear function of the deviation from nominal 
specifications, but rather a quadratic function (inverted U). 
A rattle in one place in your new car is annoying, but you 
would probably not get too upset about it; add two more 
rattles, and you might declare the car “junk.” Gradual 
deviations from the nominal specifications do not produce 
proportional increments in loss, but rather squared 
increments. 

Conclusion: Controlling variability. If, in fact, quality 
loss is a quadratic function of the deviation from a nominal 
value, then the goal of your quality improvement efforts 
should be to minimize the squared- deviations or variance of 
the product around nominal (ideal) specifications, rather 
than the number of units within specification limits (as is 
done in traditional SPC procedures). 

Signal-to-Noise (S/N) Ratios 

Measuring quality loss. Even though you have concluded 
that the quality loss function is probably quadratic in nature, 
you still do not know precisely how to measure quality loss. 
However, you know that whatever measure you decide upon 
should reflect the quadratic nature of the function. 

Signal, noise, and control factors: The product of 
ideal quality should always respond in exactly the same 
wanner to the signals provided by the user. When you turn 
Hie key in the ignition of your car you expect that the 
starter motor turns and the engine starts. In the ideal- 
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quality car, the starting process would always proceed in 
exactly the same manner—for example, after three turns of 
the starter motor the engine comes to life. If, in response to 
the same signal (turning the ignitioh key) there is random 
variability in this process, then you have less than ideal 
quality. For example, due to such uncontrollable factors as 
extreme cold, humidity, engine wears, etc. the engine may 
sometimes start only after turning over 20 times and finally 
not start at all. This example illustrates the key principle 
in measuring quality according to Taguchi: You want to 
minimize the variability in the product’s performance in 
response to noise factors while maximizing the variability 
in response to signal factors. 

Noise factors are those that are not under the control of 
the operator of a product. In the car example, those factors 
include temperature changes, different qualities of gasoline, 
engine wear, etc. Signal factors are those factors that are 
set or controlled by the operator of the product to make use 
of its intended functions (turning the ignition key to start 
the car). 

Finally, the goal of your quality improvement effort is 
to find the best settings of factors under your control that 
are involved in the production process, in order to maximize 
the S/N ratio; thus, the factors in the experiment represent 
control factors. 

S/N ratios: The conclusion of the previous paragraph is 
that quality can be quantified in terms of the respective 
product’s response to noise factors and signal factors. The 
ideal product will only respond to the operator s signals and 
will be unaffected by random noise factors (weathei, 
temperature, humidity, etc.). Therefore, the goal of your 
quality improvement effort can be stated as attempting to 
maximize the signal-to-noise (S/N) ratio for the respective 
product. The S/N ratios described in the following 
paragraphs have been proposed by Taguchi (1987). 
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Smaller-the-better: In cases where you want to 
minimize the occurrences of some undesirable product 
characteristics, you would compute the following S/N ratio: 

Eta = -10 * log 10 [(1/n) * I (y.2)] for i = 1 to no. vars 
see outer arrays 

Here, Eta is the resultant S/N ratio; n is the number of 
observations on the particular product, and y is the 
respective characteristic. For example, the number of flaws 
in the paint on an automobile could be measured as the y 
variable and analyzed via this S/N ratio. The effect of the 
signal factors is zero, since zero flaws is the only intended 
or desired state of the paint on the car. Note how this S/N 
ratio is an expression of the assumed quadratic nature of 
the loss function. The factor 10 ensures that this ratio 
measures the inverse of “bad quality;” the more flaws in the 
paint, the greater is the sum of the squared number of 
flaws, and the smaller (i.e., more negative) the S/N ratio. 
Thus, maximizing this ratio will increase quality. 

Nominal-the-best. Here, you have a fixed signal value 
(nominal value), and the variance around this value can be 
considered the result of noise factors: 

Eta = 10 * log 10 (Mean 2 /Variance) 

This signal-to-noise ratio could be used whenever ideal 
quality is equated with a particular nominal value. For 
example, the size of piston rings for an automobile engine 
must be as close to specification as possible to ensure high 
quality. 

Larger-thc-better: Examples of this type of engineering 
problem are fuel economy (miles per gallon) of an automobile, 
strength of concrete, resistance of shielding materials, etc. 
The following S/N ratio should be used: 

Eta = -10 * log 10 [(1/n) * S (lty 2 )] for i = 1 to no. vars 

Signed target. This type of S/N ratio is appropriate when 
the quality characteristic of interest has an ideal value of 0 
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(zero), and both positive and negative values of the quality 
characteristic may occur. For example, the dc-offset voltage 
of a differential operational amplifier may' be positive or 
negative (see Phadke, 1989). The following S/N ratio should 
be used for these types of problems: 

Eta = -10 * log 10 (s 2 ) for i = 1 to no. vars 

where s 2 stands for the variance of the quality 
characteristic across the measurements (variables). 

Fraction defective: This S/N ratio is useful for 
minimizing scrap, minimizing the percent of patients who 
develop side effects to a drug, etc. Taguchi also refers to the 
resultant Eta values as Omegas; note that this S/N ratio is 
identical to the familiar logit transformation 

Eta = -10 * log 10 [p/(l-p)] 
where 

p is the proportion defective 

Ordered categories (the accumulation analysis): 
In some cases, measurements on a quality characteristic 
can only be obtained in terms of categorical judgments. For 
example, consumers may rate a product as excellent, good, 
average, or below average. In that case, you would attempt 
to maximize the number of excellent or good ratings. 
Typically, the results of an accumulation analysis are 
summarized graphically in a stacked bar plot. 

Orthogonal Arrays 

The third aspect of Taguchi robust design methods is 
t he one most similar to traditional techniques. Taguchi has 
developed a system of tabulated designs (arrays) that allow 
for the maximum number of main effects to be estimated in 
an unbiased (orthogonal) manner, with a minimum numbei 
of runs in the experiment. Latin square designs, 2**(k-p) 
designs, and Box-Behnken designs main are also aimed at 
accomplishing this goal. In fact, many of the standard 
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orthogonal arrays tabulated by Taguchi are identical to 
fractional two-level factorials, Plackett-Burman designs Box 
Beh „ken designs, Latin square, Greco-Latin squares, etc. 

Analyzing Designs 

M °f TlZvf r r° b u St d6Sign ex P enmerl ts amount to 
a standard ANOVA, of the respective S/N ratios, ignoring 

two-way or higher-order interactions. However, when 

estimating error variances, one customarily pools together 

main ellects ol negligible size. 


Analyzing S/N ratios in standard designs: It should 
be noted at this point that, of course, all of the designs 
discussed up to this point (e.g., 2**(k-p), 3**(k-p) mixed 2 

and 3 level factorials, Latin squares cental tompoTite 

designs) could be used to analyze S/N ratios that you 

computed. In fact, the many additional diagnostic plots and 

other options available for those designs (e.g., estimation of 

quadratic components, etc.) may prove very useful when 

analyzing the variability (S/N ratios) in the production 
process. 


Plot of means: A visual summary of the experiment is 
the plot of the average Eta (S/N ratio) by factor levels. In 
this plot, the optimum setting (i.e., largest S/N ratio) for 
each factor can easily be identified. 

Verification experiments: For prediction purposes, 
you can compute the expected S/N ratio given a user-defined 
combination of settings of factors (ignoring factors that were 
pooled into the error term). These predicted S/N ratios can 

1 en be used in a verification experiment, where the engineer 
actually sets the machine accordingly and compares the 
Resultant observed S/N ratio with the predicted S/N ratio 
hi the experiment. If major deviations occur, one must 
conclude that the simple main effect model is not 

appropriate. 
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In those cases, Taguchi (1987) recommends transforming 
the dependent variable to accomplish additmty of factors, 
that is to “make” the main effects model fit. Phadke also 
discusses in detail methods for achieving additivity of factors. 

Accumulation Analysis 

When analyzing ordered categorical data, ANOVA is 
not appropriate. Rather, you produce a cumulative plot of 
the number of observations in a particular category. For 
each level of each factor, you plot the cumulative proportion 
of the number of defectives. Thus, this graph provides 
valuable information concerning the distribution of the 
categorical counts across the different factor settings. 


Summary 

To briefly summarize, when using Taguchi methods you 
first need to determine the design or control factors that 
can be set by the designer or engineer. Those are the factors 
in the experiment for which you will try different levels. 
Next you decide to select an appropriate orthogonal array 
for the experiment. Next, you need to decide on how to 
measure the quality characteristic of interest. Remember 
that most S/N ratios require that multiple measurements 
are taken in each run of the experiment; for example, the 
variability around the nominal value cannot otherwise be 
assessed. Finally, you conduct the experiment and identify 
the factors that most strongly affect the chosen S/N ratio, 
and you reset your machine or production process 

accordingly. 

mixture designs and triangular surfaces 


Overview 

Special issues arise when analyzing mixtures of 
components that must sum to a constant. For example, if 
you wanted to optimize the taste of a fruit-punch, consisting 
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of the juices of 5 fiuits, then the sum of the proportions of 
all juices in each mixture must be 100%. Thus, the task of 
optimizing mixtures commonly occurs in food processing, 
refining, or the manufacturing of chemicals. A number of 
designs have been developed to address specifically the 
analysis and modeling of mixtures. 


Triangular Coordinates 

The common manner in which mixture proportions can 
be summarized is via triangular (ternary) graphs. For 
example, suppose you have a mixture that consists of 3 
components A, B , and a point in the triangular coordinate 
system defined, by the three variables can summarize C. Any 
mixture of the three components. 

For example, take the following 6 different mixtures of 
the 3 components. 


A 

B 

O 

1 

0 

■ni 

mm 

1 

Wm\ 

IB 


i 

mm 

0.5 






0.5 



The sum for each mixture is 1.0, so the values for the 
components in each mixture can be interpreted as 
proportions. If you graph these data in a regular 3D 
scatterplot, it becomes apparent that the points form a 
triangle in the 3D space. Only the points inside the triangle 
where the sum of the component values is equal to 1 are 
valid mixtures. Therefore, one can simply plot only the 
triangle to summarize the component values (proportions) 
for each mixture. 
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At the vertex for the particular factor, there is a pure 
blend, that is, one that only contains the respective 
component. Thus, the coordinates for the vertex point are i 
(or 100%, or however else the mixtures are scale ) or 
respective component, and 0 (zero) for all other components. 
At the side opposite to the respective vertex, the value for 
the respective component is 0 (zero), and . (or c, 
for the other components. 


To read-off the coordinates of a point in the triangular 
graph, you would simply “drop” a line from each respective 
vertex to the side of the triangle below. 




ternary tf'CVfcSfrA t#¥*10c) 
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Triangular Surfaces and Contours 

One can now add to the trianele a a- 

“ k nrfrr 10 ?• *« 

one could plot the values for a dependent variable, or 

Tl ZtZl WaS fit to the de P^dent variable. 

Note that the response surface can either be shown in 3D, 

where the predicted response (Taste rating) is indicated by 

the distance of the surface from the triangular plane, or it 

can be indicated in a contour plot where the contours of 

constant height are plotted on the 2D triangle 



It should be mentioned at this point that you can produce 
categorized ternary graphs. These are very useful, because 
they allow you to fit to a dependent variable (e.g ., Taste) a 
response surface, for different levels of a fourth component. 


The Canonical Form of Mixture Polynomials 

Fitting a response surface to mixture data is, in principle, 
done in the same manner as fitting surfaces to, for example, 
data from central composite designs. However, there is the 
issue that mixture data are constrained, that is, the sum of 
all component values must be constant. 

Consider the simple case of two factors A and B. One 
may want to fit the simple linear model: 
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y = b 0 + b A *x A + b B *x B 

Here y stands for the dependent variable values b anH 
stand for the regression coefficients, x, and x B stand for 
the values of the factors. Suppose that x, and x D must sum 
to 1; you can multiple b 0 by 1= (x A + Xg ): B 

y = + V*b> + b A *x A + b B *x B 

or: 

y = b A* x A + b’e*x R 

Of this model 6 cornel dot ntofitting^' ThUS ; eStimation 
regression model. g n °" lnterce Pt multiple 

Common Models for Mixture Data 

simplified (as'ilhistrated fbrthe m ° d , el can be similarl y 
yielding four a a tlie simple llne ar model above), 

is.™ „ r d H « rt r r r-"* *•» 

case for ihnco a \ / be ^ ormu ^ as for the 3-variable 
details)* ^ ^ CorneU - 199 °. additional 

Linear model: 
y - bi Xj + b 2 *x 2 + b 3 *x 3 

Quadratic model : 

y = b *x, + b„*x + b -l v * * 
b 9 o x„*x Q “ 33 ° 12 x i X 2 + b 13 x i X 3 + 

Special cubic model : 

, J ~ b i x i + + b„*x 0 + b *x *x + b *x *x + 

b 23* X 2* X 3 + b^Vx. b 12 X 1 X 2 + b 13 X 1 X 3 

Lull cubic model : 

y bj Xj + b 9 *x« + b 0 *Xo -f b *x + b *x *x + 

23 x 2 X 3 + d 12 *x 1 *x,,*(x 1 - x> ) + d *x *x *(x - x ) + 
d 23*X 2 *X 3 *(x 2 - x 3 ) + b 19 3* Xl *X 9 4 18 1 3 1 3 

e that the d - s are also parameters of the model.) 
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Standard Designs for Mixture Experiments 

Two different types of standard designs are commonly 
used for experiments with mixtures. Both of them will 
evaluate the triangular response surface at the vertices (i.e., 
the corners of the triangle) and the centroids (sides of the 
triangle). Sometimes, those designs are enhanced with 
additional interior points. 

Simplex-lattice designs. In this arrangement of design 
points, m+I equally spaced proportions are tested for each 
factor or component in the model: 


= 0, 1/m, 2/m, ..., 1 i= 1,2,...,q 

and all combinations of factor levels are tested. The resulting 
design is called a (q,m} simplex lattice design. For example, 
a {q-3, m-2} simplex lattice design will include the following 
mixtures: 


H 

B 

C 

l 

0 

0 

0 

1 

0 

0 

0 

1 

.5 

.5 

0 

.5 

0 

.5 

0 

.5 

.5 


A {q=3,m=3} simplex lattice design will include the 
points: 


A 

B 

C 

1 

0 

0 

0 

1 

0 

0 

0 

1 

1/3 

1/3 

0 

1/3 

0 

2/3 

0 

1/3 

2/3 

2/3 

1/3 

0 

2/3 

0 

1/3 

0 

2/3 

1/3 

1/3 

1/3 

1/3 
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Simplex-centroid designs. An alternative arrangement 
of settings introduced by Scheffe (1963) is the so-called 
simplex-centroid design. Here the design points correspond 
to all permutations of the pure blends ( e.g ., 1 0 0; 0 1 0; 0 0 
1), the permutations of the binary blends ('/* Vs 0; V 2 0 Vs; 0 
Vs Vs ), the permutations of the blends involving three 
components, and so on. For example, for 3 factors the simplex 
centroid design consists of the points: 


A 

B 

C 

1 

0 

0 

0 

1 

0 

0 

0 

1 

1/2 

1/2 

0 

1/2 

0 

1/2 

0 

1/2 

1/2 

1/3 

1/3 

1/3 


Adding interior points. These designs are sometimes 
augmented with interior points. For example, for 3 factors 
one could add the interior points: 


A 

B 

C 

2/3 

1/6 

1/6 

1/6 

2/3 

1/6 

1/6 

1/6 

2/3 


If you plot these points in a scatterplot with triangular 
coordinates; one can see how these designs evenly cover the 
experimental region defined by the triangle. 

Lower Constraints 

The designs described above all require vertex points, 
that is, pure blends consisting of only one ingredient. In 
practice, those points may often not be valid, that is, pure 
blends cannot be produced because of cost or other 
constraints. For example, suppose you wanted to study the 
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effect of a food-additive on the taste of the fruit-punch. The 
additional ingredient may only be varied within small limits, 
for example, it may not exceed a certain percentage of the 
total. Clearly, a fruit punch that is a pure blend, consisting 
only of the additive, would not be a fruit punch at all, or 
worse, may be toxic. These types of constraints are very 
common in many applications of mixture experiments. 

Let us consider a 3-component example, where 
component A is constrained so that x A> .3. The total of the 
3-component mixture must be equal to 1. This constraint 
can be visualized in a triangular graph by a line at the 
triangular coordinate for x A =3, that is, a line that is parallel 
to the triangle’s edge opposite to the A vertex point. 



One can now construct the design as before, except that 
one side of the triangle is defined by the constraint. Later, 
in the analysis, one can review the parameter estimates for 
the so-called pseudo-components , treating the constrained 
triangle as if it were a full triangle. 

Multiple constraints: Multiple lower constraints can 
be treated analogously, that is, you can construct the sub¬ 
triangle within the full triangle, and then place the design 
points in that sub-triangle according to the chosen design. 
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Upper and Lower Constraints 

When there are both upper and lower constraints (as is 
often the case in experiments involving mixtures), then the 
standard simplex-lattice and simplex-centroid designs can 
no longer be constructed, because the subregion defined by 
the constraints is no longer a triangle. There is a general 
algorithm for finding the vertex and centroid points for such 
constrained designs. 



Note that you can still analyze such designs by fitting 
the standard models to the data. 

Analyzing Mixture Experiments 

The analysis of mixture experiments amounts to a 
multiple regression with the intercept set to zero. As 
explained earlier, fitting multiple regression models that do 
not include an intercept term can accommodate the mixture 
constraint—that the sum of all components must be 
constant—. If you are not familiar with multiple regressions, 
you may want to review at this point Multiple Regression. 

The specific models that are usually considered were 
described earlier. To summarize, one fits to the dependent 
variable response surfaces of increasing complexity, that is. 









fixperi ,nen t a l Design (Industrial DOE) 


__ 

starting with the linear model th^n 
special cubic model, and full cubic model* model> 

table with the number of terms or paramet h ° Wn bel ° W 13 a 
for a selected number of components MS 6aCh mode1 ' 



To decide which of the models of increasing complexity 
provides a sufficiently good fit to the observed data, one 
usually compares the models in a hierarchical, stepwise 
fashion. For example, consider a 3- component mixture to 
which the full cubic model was fitted. 


First, the linear model was fit to the data. Even though 
this model has 3 parameters, one for each component, this 
model has only 2 degrees of freedom. This is because of the 
overall mixture constraint, that the sum of all component 
values is constant. The simultaneous test for all parameters 
of this model is statistically significant (F(2,ll)=5.25; p<.05). 
The addition of the 3 quadratic model parameters {b I2 *x*x 2 , 
° 1 * X * X 3 - b 2 ;* x 2 * x 3 > further significantly improves the fit of 
the model (F(3,8)=4.99; p<.05). However, adding the 
parameters for the special cubic and cubic models does not 
significantly improve the fit of the surface. Thus one could 
conclude that the quadratic model provides an adequate fit 
to the data (of course, pending further examination of the 
residuals for outliers, etc.). 




_ ANOVA; Var.: DV (mixt4.sta) 

3 Factor mixture design; Mixture total=l., 14 Runs 
Sequential fit of models of increasing complexity 
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R-square: The R-sguare value can be interpreted as 
the proportron of vanabtlity around the mean for the 
dependent variable, that can be accounted for by the 
respective model. (Note that for non-intercept models some 
multiple regression programs will only compute the R-sauare 
value pertaining to the proportion of variance around 0 

(zero) accounted for by the independent variables; for more 
information. 


Pure error and lack of fit: The usefulness of the 
estimate of pure error for assessing the overall lack of fit 
was discussed in the context of central composite designs. If 
some runs in the design were replicated, then one can 
compute an estimate of error variability based only on the 
vaiiability between replicated runs. This variability provides 
a good indication of the unreliability in the measurements, 
independent of the model that was fit to the data, since it is 
based on identical factor settings (or blends in this case). 
One can test the residual variability after fitting the current 
model against this estimate of pure error. 

If this test is statistically significant, that is, if the 
residual variability is significantly larger than the pure error 
variability, then one can conclude that, most likely, there 
are additional significant differences between blends that 
cannot be accounted for by the current model. Thus, there 
may be an overall lack of fit of the current model. In that 
case, try a more complex model, perhaps by only adding 
individual terms of the next higher-order model (e.g., only 
the b 13 *x*x 3 to the linear model). 

Parameter Estimates 

Usually, after fitting a particular model, one would next 
leview the parameter estimates. Remember that the linear 
terms in mixture models are constrained, that is, the sum 
°f the components must be constant. Hence, independent 
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statistical significance tests for the linear components cannot 
be performed. 


Pseudo-Components 

To allow for scale-independent comparisons of the 
parameter estimates, during the analysis, the component 
settings are customarily recoded to so-called pseudo- 

components. 


x ’. = (xj-Lj)/(T otal-L) 

* Here, stands for the i’th pseudo-component, *. stands 
for the original component value, L { stands for the lower 
constraint (limit) for the i’th component, L stands for the 
sum of all lower constraints (limits) for all components in 
the design, and Total is the mixture total. 

The issue of lower constraints was also discussed earlier 
in this section. If the design is a standard simplex-lattice or 
simplex-centroid design (see above), then this transformation 
amounts to a rescaling of factors so as to form a sub-triangle 
(sub-simplex) as defined by the lower constraints. However, 
you can compute the parameter estimates based on the 
original (untransformed) metric of the components in the 
experiment. If you want to use the fitted parameter values 
for prediction purposes (i.e., to predict dependent variable 
values), then the parameters for the untransforme 
components are often more convenient to use. Note that the 
results dialog for mixture experiments contains options o 
make predictions for the dependent variable for user-define 
values of the components, in their original metric. 


Graph Options 

Surface and contour plots: The respective fitted model 
can be visualized in triangular surface plots or contour plots 
which, optionally, can also include the respective fitted 

function. 
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Of course, if you have other categorical variables in your 
study ( e.g ., operator or experimenter; machine, etc.) you can 
a so categorize the 3D surface plot by those variables. 
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Note that the fitted function displayed in the surface 
and contour plots always pertain to the parameter estimates 
for the pseudo-components. 

Categorized surface plots. If your design involves 
replications (and the replications are coded in your data 
file), then you can use 3D Ternary Plots to look at the 
respective fit, replication-by-replication. 
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Trace plots: One aid for interpreting the triangular 
response surface is the so-called trace plot. Suppose you 
looked at the contour plot of the response surface for three 
components. Then, determine a reference blend for two of 
the components, for example, hold the values for A and B at 
Ij3 each. Keeping the relative proportions of A and B 
constant ( i.e ., equal proportions in this case), you can then 
plot the estimated response (values for the dependent 
variable) for different values of C. 
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If the reference blend for A and B is 1:1, then the 
resulting line or response trace is the axis for factor C; that 
is, the line from the C vertex point connecting with the 
opposite side of the triangle at a right angle. However, trace 
plots for other reference blends can also be produced. 
Typically, the trace plot contains the traces for all 
components, given the current reference blend. 


Residual plots: Finally, it is important, after deciding 
on a model, to review the prediction residuals, in order to 
identify outliers or regions of misfit-fit. In addition, one 
should review the standard normal probability plot of 
residuals and the scatterplot of observed versus predicted 
values. Remember that the multiple regression analysis {i.e., 
the process of fitting the surface) assumes that the residuals 
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are normally distributed, and one should carefully review 
the residuals for any apparent outliers. 


DESIGNS FOR CONSTRAINED SURFACES AND 
MIXTURES AND 

Overview 

As mentioned in the context of mixture designs, it often 
happens in real-world studies that the experimental region 
of interest is constrained, that is, that not all factors settings 
can be combined with all settings for the other factors in 
the study. There is an algorithm suggested by Piepel (1988) 

and Snee (1985) for finding the vertices and centroids for 
such constrained regions. 


Designs for Constrained Experimental Regions 

When in an experiment with many factors, there are 
constraints concerning the possible values of those factors 
and their combinations, it is not clear how to proceed. A 
reasonable approach is to include in the experiments runs 
at the extreme vertex points and centroid points of the 
constrained region, which should usually provide good 
coverage of the constrained experimental region. In fact, 
the mixture designs reviewed in the previous section provide 
examples for such designs, since they are typically 
constructed to include the vertex and centroid points of the 
constrained region that consists of a triangle (simplex). 

Linear Constraints 

One general way in which one can summarize most 
constraints that occur in real world experimentation is in 
terms of a linear equation (see Piepel, 1988): 

A i x i + A 2 x 2 + ... + A q x q + A 0 >0 

Here, A 0 , .., A q are the parameters for the linear 
constraint on the q factors, and x v .., x q stands for the factor 
values (levels) for the q factors. This general formula can 
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accommodate even very complex constraints. For example 
suppose that in a two-factor experiment the first factor must 
always be set at least twice as high as the second, that is 
Xj > 2*x 2 . This simple constraint can be rewritten as x r 2*x 
> 0. The ratio constraint 2*x l /x£> 1 can be rewritten as 
2*Xj - x 2 > 0, and so on. 

The problem of multiple upper and lower constraints on 
the component values in mixtures was discussed earlier, in 
the context of mixture experiments. For example, suppose 
in a three-component mixture of fruit juices, the upper and 
lower constraints on the components are (see example 3.2, 
in Cornell 1993): 

40% < Watermelon (x^ < 80% 

10% < Pineapple (x 2 ) < 50% 

10% < Orange (x 3 ) < 30% 

These constraints can be rewritten as linear constraints 
into the form: 

Watermelon: Xj-40 > 0 

-x L +80 >0 

Pineapple: x 2 -10 > 0 

-x 9 +50 > 0 

Orange: x 3 -10>0 

-x 3 +30 > 0 

Thus, the problem of finding design points for mixture 
experiments with components with multiple upper and lower 
constraints is only a special case of general linear 

constraints. 

The Piepel and Snee Algorithm 

For the special case of constrained mixtures, algorithms 
such as the XVERT algorithm (see, for example, Cornell, 
1990) are often used to find the vertex and centroid points 
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Of the constrained region (inside the triangle of three 
components, tetrahedron of four components, etc.). The 
general algorithm proposed by Piepel (1988) and Snee (1979) 
for finding vertices and centroids can he applied to mixtures 
as wel as non-mixtures. The general approach of this 
algorithm is described in detail by Snee (1979) 


Specifically, it will consider one-by-one each constraint 
written as a linear equation as described above. Each 
constraint represents a line (or plane) through the 
experimental region. For each successive constraint the you 
will evaluate whether or not the current (new) co ns tr aint 
crosses into the current valid region of the design. If so, 
new vertices Will be computed which define the new valid 
experimental region, updated for the most recent constraint. 
It will then check whether or not any of the previously 
processed constraints have become redundant, that is, define 
lines or planes in the experimental region that are now 
entirely outside the valid region. After all constraints have 
been processed, it will then compute the centroids for the 
sides of the constrained region (of the order requested by 
the user). For the two-dimensional (two-factor) case, one 
can easily recreate this process by simply drawing lines 
through the experimental region, one for each constraint; 
what is left is the valid experimental region. 
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Choosing Points for the Experiment 

Once the vertices and centroids have been computed, 
you may face the problem of having to select a subset of 
points for the experiment. If each experimental run is costly, 
then it may not be feasible to simply run all vertex and 
centroid points. In particular, when there are many factors 
and constraints, then the number of centroids can quickly 
get very large. If you are screening a large number of factors, 
and are not interested in non-linear effects, then choosing 
the vertex points only will usually yield good coverage of 
the experimental region. To increase statistical power (to 
increase the degrees of freedom for the ANOVA error term), 
you may also want to include a few runs with the factors 
set at the overall centroid of the constrained region. 

If you are considering a number of different models that 
you might fit once the data have been collected, then you 
may want to use the D- and A-optimal design options. Those 
options will help you select the design points that will extract 
the maximum amount of information from the constrained 
experimental region, given your models. 

Analyzing Designs for Constrained Surfaces and 
Mixtures 

As mentioned in the section on central composite designs 
and mixture designs, once the constrained design points 
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have been chosen for the final j. , , 

the dependent variables of intent have 

analysis of these designs can proceed in the standard 
manner. 


t0r ex “X> (1990, page 68) describes an 

experiment of three plasticizers, and their effect on resultant 

vmyl thickness (for automobile seat covers). The constraints 

for the three plasticizers components x p x 2 , and x 3 are: 


.409 < x 1 < .849 
.000 < ^ < .252 
.151 < Xg < .274 


(Note that these values are already rescaled, so that the 
total for each mixture must be equal to 1.) The vertex and 
centroid points generated are: 


X 1 

x 2 

X 3 

.8490 

.0000 

.1510 

.7260 

.0000 

.2740 

.4740 

.2520 

.2740 

.5970 

.2520 

.1510 

.6615 

.1260 

.2125 

.7875 

.0000 

.2125 

.6000 

.1260 

.2740 

.5355 

.2520 

.2125 

.7230 

.1260 

.1510 
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CONSTRUCTING D- AND A-OPTIMAL DESIGN 
Overview 

In the sections on standard factorial designs (see 2**(k- 
p) Fractional Factorial Designs and 3**(k-p), Box Behnken, 
and, Mixed. 2 and 3 Level Factorial Designs) and Central 
Composite Designs, the property of orthogonality of factor 
effects was discussed. In short, when the factor level settings 
for two factors in an experiment are uncorrelated, that is, 
when they are varied independently of each other, then 
they are said to be orthogonal to each other. (If you are 
familiar with matrix and vector algebra, two column vectors 
Xj and X 2 in the design matrix are orthogonal if X 1 '*X 2 = 0). 
Intuitively, it should be clear that one can extract the 
maximum amount of information regarding a dependent 
variable from the experimental region (the region defined 
by the settings of the factor levels), if all factor effects are 
orthogonal to each other. Conversely, suppose one ran a 
four-run experiment for two factors as follows: 



x i 

X 2 

Run 1 

1 

1 

Run 2 

1 

1 

Run 3 

-1 

-1 

Run 4 

-1 

-1 


Now the columns of factor settings for X } and X 2 are 
identical to each other (their correlation is 1), and there is 
no way in the results to distinguish between the main effect 
for Xj and X 2 . 

The D- and A-optimal design procedures provide various 
options to select from a list of valid (candidate) points (i.e., 
combinations of factor settings) those points that will extract 
the maximum amount of information from the experimental 
region, given the respective model that you expect to fit to 
the data. You need to supply the list of candidate points, for 
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example the vertex and centroid 

Designs for constrained surface and miJ COmputed b y the 

the type of model you expect Z &t l T T™' ^ 

number of runs for the experiment It w ii ata ’ and tlle 
, . CA P e riment. It will then construct a 

des, g n with the desired number of cases, that will provide 

as much orthogonality between the columns of the dei gn 
matrix as possible. e 


Basic Ideas 

A technical discussion of the reasoning (and limitations) 
of D- and A-optimal designs is beyond the scope of this 
introduction. However, the general ideas are fairly 

stiaightforward. Consider again the simple two-factor 
experiment in four runs. 



x i 

x 2 

Run 1 

1 

1 

Run 2 

1 

1 

Run 3 

-1 

-1 

Run 4 

-1 



As mentioned above, this design, of course, does not 
allow one to test, independently, the statistical significance 
of the two variables’ contribution to the prediction of the 
dependent variable. If you computed the correlation matrix 
for the two variables, they would correlate at 1 : 



X 1 

X 2 

Xi 

1.0 


x 2 

1.0 


1.0 



1.0 




Normally, one would run this experiment so that the 
two factors are varied independently of each other. 
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x i 

x 2 

Run 1 

1 

1 

Run 2 

1 

-1 

Run 3 

-1 

1 

Run 4 

-1 

-1 


Now the two variables are uncorrelated, that is, the 
correlation matrix for the two factors is: 



X 1 

X 2 

x , 

1.0 


x 2 

1.0 


0.0 



1.0 




Another term that is customarily used in this context is 
that the two factors are orthogonal. Technically, if the sum 
of the products of the elements of two columns ( vectors) in 
the design (design matrix) is equal to 0 (zero), then the two 
columns are orthogonal. 

The determinant of the design matrix: The 
determinant D of a square matrix (like the 2-by-2 correlation 
matrices shown above) is a specific numerical value that 
reflects the amount of independence or redundancy between 
the columns and rows of the matrix. For the 2-by-2 case, it 
is simply computed as the product of the diagonal elements 
minus the off-diagonal elements of the matrix (for larger 
matrices the computations are more complex). For example, 
for the two matrices shown above, the determinant D is: 

D 1 = 11.0 1.0| = 1*1 - 1*1 = 0 

11.0 1.01 

D 2 = 11.0 0.01 =1*1 -0*0 = 1 

10.0 1.01 
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Thus, the determinant f or the f ~ 

from completely redundant factor matrix computed 

determinant for the second matnv ^ 1S equal to °' The 
orthogonal, is equal to 1. ' when the factors are 

D-optimal designs* Thio • 
larger design matrices, that S ^ re atl0nshlp exten ds to 
vectors (columns) of the design matrix , redundant the 

is the determinant of the correlation matrix forth ° (ZWo) 
the more independent the column* ^ those vect °rs; 
determinant of that matrix. Thu* f 1- 6 larger 1S the 

that maximizes the determinant D o/iV * deSlgn matrix 
finding a design where “tof ^ 

independent of each other. This criterion for selec mg a 
design is called the D-optimality criterion. 

Matrix notation: Actually, the computations are 

commonly not performed on the correlation matrix of vectors 

but on the simple cross-product matrix. In matrix notation, 

if X denotes the design matrix, then the quantity of interest 

here is the determinant of XX (X- transposed times X) 

Thus the search for D-optimal designs aims to maximize 

IXXI, where the vertical lines (|..|) indicate the 
determinant. 


A-optimal designs: Looking back at the computations 
lor the determinant, another way to look at the issue of 
independence is to maximize the diagonal elements of the 
X’X matrix, while minimizing the off-diagonal elements. The 
so-called trace criterion or A-optimality criterion expresses 
this idea. Technically, the A-criterion is defined as: 

A = trace (X’X)" 1 

where trace stands for the sum of the diagonal elements (of 
die (X’X) 1 matrix). 

The information function: It should be mentioned at 
this point that D-optimal designs minimize the expected 
Piediction error for the dependent variable, that is, those 
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designs will maximize the precision of prediction, and thus 

the information (which is defined as the inverse of the error) 

that is extracted from the experimental region of interest. 

« 

Measuring Design Efficiency 

A number of standard measures have been proposed to 
summarize the efficiency of a design. 

D-efficiency. This measure is related to the Z)-optimality 
rHfprion * 

D-efficiency = 100 * (I X’X 1 1,p /N) 

Here, p is the number of factor effects in the design 
(columns in X ), and N is the number of requested runs. 
This measure can be interpreted as the relative number of 
runs (in percent) that would be required by an orthogonal 
design to achieve the same value of the determinant | X X |. 
However, remember that an orthogonal design may not be 
possible in many cases, that is, it is only a theoretical “yard¬ 
stick.” Therefore, you should use this measure rather as a 
relative indicator of efficiency, to compare other designs of 
the same size, and constructed from the same design points 
candidate list. Also note that this measure is only meaningful 
(and will only be reported) if you chose to recode the factor 
settings in the design (i.e., the factor settings foi the design 
points in the candidate list), so that they have a minimum 
of -1 and a maximum of + 1 . 

A-efficiency. This measure is related to the A-optimality 
criterion: 

A-efficiency = 100 * p/traceCN^X’X) 1 ) 

Here, p stands for the number of factor effects in the 
design, N is the number of requested runs, and trace stands 
for the sum of the diagonal elements (of (N*(X*X)‘ ) )• This 
measure can be interpreted as the relative number o runs 
(in percent) that would be required by an orthogonal design 
to achieve the same value of the trace of (XX) . However, 
again you should Use this measure as a relative indicator ot 
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efficiency, to compare other designs of 
constructed from the same desie-n iW * h ? ame Slze and 
this measure is only meaningful if m S ^ andidate list ^ also 
factor settings in tj ““ 

G^ffidency. This measure is computed as: 

G-efficiency - 100 * square root(p/N)/ aM 

Again, p stands for the number of factor effects in the 
design and N is the number of requested runs- ( sisma > 
stands for the maximum standard error for predict!'Loss 
the list of candidate points. This measure is related to the 
so-called G- optimality criterion; G-optimal designs are 
defined as those that will minimize the maximum value of 
the standard error of the predicted response. 


Constructing Optimal Designs 

The optimal design facilities will “search for” optimal 
esigns, given a list of “candidate points.” Put another way, 
given a list of points that specifies which regions of the 
design are valid or feasible, and given a user-specified 
number of runs for the final experiment, it will select points 
to optimize the respective criterion. This “searching for” the 
best design is not an exact method, but rather an algorithmic 
procedure that employs certain search strategies to find the 
best design (according to the respective optimality criterion). 

The search procedures or algorithms that have been 
proposed are described below. They are reviewed here in 
the order of speed, that is, the Sequential or Dykstra method 
is the fastest method, but often most likely to fail, that is, 
to yield a design that is not optimal (e.g., only locally optimal; 
this issue will be discussed shortly). 


Sequential or Dykstra method: This algorithm is due 
to Dykstra (1971). Starting with an empty design, it will 
search through the candidate list of points, and choose in 
each step the one that maximizes the chosen criterion. There 
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are no iterations involved, they will simply pick the 
requested number of points sequentially. Thus, this method 
is the fastest of the ones discussed. Also, by default, this 
method is used to construct the initial designs for the 
remaining methods. 

Simple exchange (Wynn-Mitchell) method: This 
algorithm is usually attributed to Mitchell and Miller (1970) 
and Wynn (1972). The method starts with an initial design 
of the requested size (by default constructed via the 
sequential search algorithm described above). In each 
iteration, one point (run) in the design will be dropped from 
the design and another added from the list of candidate 
points. The choice of points to be dropped or added is 
sequential, that is, at each step the point that contributes 
least with respect to the chosen optimality criterion (jD or 
A) is dropped from the design; then the algorithm chooses a 
point from the candidate list so as to optimize the respective 
criterion. The algorithm stops when no further improvement 
is achieved with additional exchanges. 

DETMAX algorithm (exchange with excursions). This 
algorithm, due to Mitchell (1974b), is probably the best 
known and most widely used optimal design search 
algorithm. Like the simple exchange method, first an initial 
design is constructed (by default, via the sequential search 
algorithm described above). The search begins with a simple 
exchange as described above. However, if the respective 
criterion (D or A) does not improve, the algorithm will 
undertake excursions. Specifically, the algorithm will add 
or subtract more than one point at a time, so that, during 
the search, the number of points in the design may vary 

between N D + N excursion and N D ^excursion’ where iVp is 
requested design size, and N excursion refers to the maximum 
allowable excursion, as specified by the user. The iterations 
will stop when the chosen criterion (D or A) no longer 
improves within the maximum excursion. 
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Modified Fedorov (simnlfo 
algorithm represents a modificatioiWr US i, SWltClling ^ Tbis 
1980) of the basic Fedorov algoS^ and u Nachtsh «*. 
also begins with an initial design of «, ® Scnbed below - It 
default constructed via the seauentinl 6 re ^ uested size fry 
each iteration, the algorithm will Saarcb al g°nthm). In 
the design with one chosen from L ^f eaCh P ° int ifl 
optimize the design according to the rh” ' 346 ! ' St ' S ° aS to 
A). Unlike the simple exchange algorithm described aWe 
the exchange is not sequential, but simultaneous. Thus 1 
each iteration each point in the design is compared wi h 
each point in he candidate list, and the exchange is mat 
or the pair that optimizes the design. The algorithm 
terminates when there are no further improvements in the 
respective optimality criterion. 


Fedorov (simultaneous switching). This is the original 
simultaneous switching method proposed by Fedorov. The 
difference between this procedure and the one described 
above (modified Fedorov) is that in each iteration only a 
single exchange is performed, that is, in each iteration all 
possible pairs of points in the design and those in the 
candidate list are evaluated. The algorithm will then 
exchange the pair that optimizes the design (with regard to 
the chosen criterion). Thus, it is easy to see that this 
algorithm potentially can be somewhat slow, since in each 
iteration N D *N C comparisons are performed, in order to 
exchange a single point. 

General Recommendations 

If you think about the basic strategies represented by 
the different algorithms described above, it should be clear 
that there are usually no exact solutions to the optimal 
design problem. Specifically, the determinant of the XX 
matrix (and, trace of its inverse) is complex functions of the 
list of candidate points. In particular, there are usually 
several “local minima” with regard to the chosen optimality 
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criterion; for example, at any point during the search a 
design may appear optimal unless you simultaneously 
discard half of the points in the design and choose certain 
other points from the candidate list; but, if you only exchange 
individual points or only a few points (via DETMAX), then 
no improvement occurs. 

Therefore, it is important to try a number of different 
initial designs and algorithms. If after repeating the 
optimization several times with random starts the same, or 
very similar, final optimal design results, then you can be 
reasonably sure that you are not “caught” in a local minimum 
or maximum. 


Also, the methods described above vary greatly with 
regard to their ability to get “trapped” in local minima or 
maxima. As a general rule, the slower the algorithm (i.e., 
the further down on the list of algorithms described above), 
the more likely is the algorithm to yield a truly optimal 
design. However, note that the modified 


D-optimality and A-optimality: For computational 
reasons (see Galil and Kiefer, 1980), updating the trace of a 
matrix (for the ^optimality criterion) is much slower than 
updating the determinant (for Doptimality). Thus, when 
you choose the A-optimality criterion, the computations may 
require significantly more time as compared to the U- 
optimality criterion. Since in practice, there are many other 
factors that will affect the quality of an experiment (e.^, 

the measurement reliability for the dependent vanab e) 

generally recommend that you use the D optimality^™ 

However, in dim.nl. dee*, f ^ S 

there appear to be many local results> yQU may 

and repeated trials yield very using the A cr i te rion 

want to run several optimizat on ^ are 

to learn more about the dilreren yp 
possible. 
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Avoiding Matrix Singularity 

It may happen during the search process that it cannot 
compute the inverse of the XX matrix (for A-optimality), or 
that the determinant of the matrix becomes almost 0 (zero). 
At that point, the search can usually not continue. To avoid 
this situation, perform the optimization based on an 
augmented XX matrix: 

^ ^augmented ~ ^ X + (X *(X Q Xq/N q ) 

where X Q stands for the design matrix constructed from the 
list of all N 0 candidate points, and a (alpha) is a user- 
defined small constant. Thus, you can turn off this feature 
^ by setting a to 0 (zero). 

“Repairing” Designs 

The optimal design features can be used to “repair” 
designs. For example, suppose you ran an orthogonal design, 
but some data were lost (e.g., due to equipment malfunction), 
and now some effects of interest can no longer be estimated. 
You could of course make up the lost runs, but suppose you 
do not have the resources to redo them all. In that case, you 
can set up the list of candidate points from among all valid 
points for the experimental region, add to that list all the 
points that you have already run, and instruct it to always 
foice those points into the final design (and never to drop 
them out; you can mark points in the candidate list for such 
forced inclusion). It will then only consider to exclude those 
points from the design that you did not actually run. In this 
manner you can, for example,, find the best single run to 
add to an existing experiment that would optimize the 
respective criterion. 

Constrained Experimental Regions and Optimal 
Design 

A typical application of the optimal design features is to 
situations when the experimental region of interest is 
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constrained. As described earlier in this section, there are 
facilities for finding vertex and centroid points for linearly 
constrained regions and mixtures. Those points can then be 
submitted as the candidate list for constructing an optimal 
design of a particular size for a particular model. Thus 
these two facilities combined provide a very powerful tool to 
cope with the difficult design situation when the design 
region of interest is subject to complex constraints, and one 

wants to fit particular models with the least number of 
runs. 

Special Topics 

The following sections introduce several analysis 
techniques. 

Profiling Predicted Responses and Response 
Desirability 

Basic Idea: A typical problem in product development 
is to find a set of conditions, or levels of the input variables, 
that produces the most desirable product in terms of its 
characteristics, or responses on the output variables. The 
procedures used to solve this problem generally involve two 
steps: (1) predicting responses on the dependent, or Y 
variables, by fitting the observed responses using an 
equation based on the levels of the independent, or X 
variables, and (2) finding the levels of the X variables which 
simultaneously produce the most desirable predicted 
responses on the Y variables. Derringer and Suich (1980) 
give, as an example of these procedures, the problem of 
finding the most desirable tire tread compound. There are a 
number of Y variables, such as PICO Abrasion Index, 200 
percent modulus, elongation at break, and hardness. The 
characteristics of the product in terms of the response 
variables depend on the ingredients, the X variables, such 
as hydrated silica level, silane coupling agent level, and 
sulfur. The problem is to select the levels for the Xs, which 
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will maximize the desirability of the responses on the Y’s. 

The solution must take into account the fact that the levels 

for the Xs that maximize one response may not maximize a 
different response. 

When analyzing 2**(k-p) (two-level factorial) designs, 2- 
level screening designs, 2**(k-p) maximally unconfounded 
and minimum aberration designs, 3**(k-p) and Box Behnken 
designs, Mixed 2 and 3 level designs, central composite 
designs, and mixture designs, Response/desirability profiling 
allows you to inspect the response surface produced by fitting 
the obseived responses using an equation based on levels of 
the independent variables. 

Prediction Profiles. When you analyze the results of 
any of the designs listed above, a separate prediction 
equation for each dependent variable (containing different 
coefficients but the same terms) is fitted to the observed 
responses on the respective dependent variable. Once these 
equations are constructed, predicted values for the dependent 
variables can be computed at any combination of levels of 
the predictor variables. A prediction profile for a dependent 
variable consists of a series of graphs, one for each 
independent variable, of the predicted values for the 
dependent variable at different levels of one independent 
variable, holding the levels of the other independent 
variables constant at specified values, called current values. 
If appropriate current values for the independent variables 
have been selected, inspecting the prediction profile can 
show which levels of the predictor variables produce the 
most desirable predicted response on the dependent variable. 

One might be interested in inspecting the predicted 
values for the dependent variables only at the actual levels 
at which the independent variables were set during the 
experiment. Alternatively, one also might be interested in 
inspecting the predicted values for the dependent variables 
at levels other than the actual levels of the independent 
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variables used during the experiment, to see if there might 
be intermediate levels of the independent variables that 
could produce even more desirable responses. Also, returning 
to the Derringer and Suich (1980) example, for some 
response variables, the most desirable values may not 
necessarily be the most extreme values, for example, the 
most desirable value of elongation may fall within a narrow 
range of the possible values. 

Response Desirability: Different dependent variables 
might have different kinds of relationships between scores 
on the variable and the desirability of the scores. Less filling 
beer may be more desirable, but better tasting beer can also 
be more desirable—lower “fillingness” scores and higher 
“taste” scores are both more desirable. The relationship 
between predicted responses on a dependent variable and 
the desirability of responses is called the desirability 
function. Derringer and Suich (1980) developed a procedure 
for specifying the relationship between predicted responses 
on a dependent variable and the desirability of the responses, 
a procedure that provides for up to three “inflection” points 
in the function. Returning to the tire tread compound 
example described above, their procedure involved 
transforming scores on each of the four tire tread compound 
outcome variables into desirability scores that could range 
from 0.0 for undesirable to 1.0 for very desirable. 

For example, their desirability function for hardness of 
the tire tread compound was defined by assigning a 
desirability value of 0.0 to hardness scores below 60 or 
above 75, a desirability value of 1.0 to mid-point hardness 
scores of 67.5, a desirability value that increased linearly 
from 0.0 up to 1.0 for hardness scores between 60 and 67.5 
and a desirability value that decreased linearly from 1.0 
down to 0.0 for hardness scores between 67.5 and 75.0. 
More generally, they suggested that procedures for defining 
desirability functions should accommodate curvature in the 
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falloff of desirability between inflection points in the 
functions. 

After transforming the predicted values of the dependent 
variables at different combinations of levels of the predictor 
variables into individual desirability scores, the overall 
desirability of the outcomes at different combinations of 
levels of the predictor variables can be computed. Derringer 
and Suich (1980) suggested that overall desirability be 
computed as the geometric mean of the individual 
desirabilities (which makes intuitive sense, because if the 
individual desirability of any outcome is 0.0, or unacceptable, 
the overall desirability will be 0.0, or unacceptable, no matter 
how desirable the other individual outcomes are—the 
geometric mean takes the product of all of the values, and 
raises the product to the power of the reciprocal of the 
number of values). Derringer and Suich’s procedure provides 
a straightforward way for transforming predicted values for 
multiple dependent variables into a single overall desirability 
score. The problem of simultaneously optimization of several 
response variables then boils down to selecting the levels of 
the predictor variables that maximize the overall desirability 
of the responses on the dependent variables. 

Summary: When one is developing a product whose 
characteristics are known to depend on the “ingredients” of 
which it is constituted, producing the best product possible 
requires determining the effects of the ingredients on each 
characteristic of the product, and then finding the balance 
of ingredients that optimizes the overall desirability of the 
pioduct. In data analytic terms, the procedure that is 
followed to maximize product desirability is to (1) find 
adequate models (i.e., prediction equations) to predict 
characteristics of the product as a function of the levels of 
the independent variables, and (2) determine the optimum 
levels of the independent variables for overall product 
quality. These two steps, if followed faithfully, will likely 
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lead to greater success in product improvement than the 
fabled but statistically dubious technique of hoping for 
accidental breakthroughs and discoveries that radically 
improve product quality. 


Residuals Analysis 

Basic Idea. Extended residuals analysis is a collection of 
methods for inspecting different residual and predicted 
values, and thus to examine the adequacy of the prediction 
model, the need for transformations of the variables in the 
model, and the existence of outliers in the data. 

Residuals are the deviations of the observed values on 
the dependent variable from the predicted values, given the 
current model. The ANOVA models used in analyzing 
responses on the dependent variable make certain 
assumptions about the distributions of residual (but not 
predicted) values on the dependent variable. Saying that 
the ANOVA model assumes normality, linearity, homo¬ 
geneity of variances and covariances, and independence of 
residuals can summarize these assumptions. All of these 
properties of the residuals for a dependent variable can be 
inspected using Residuals analysis. 

Box-Cox Transformations of Dependent Variables 

Basic Idea. It is assumed in analysis of variance that 
the variances in the different groups (experimental 
conditions) are homogeneous, and that they are uncorrelated 
with the means. If the distribution of values within each 
experimental condition is skewed, and the means are 
correlated with the standard deviations, then one can often 
apply an appropriate power transformation to the dependent 
variable to stabilize the variances, and to reduce or eliminate 
the correlation between the means and standard deviations. 
The Box-Cox transformation is useful for selecting an 
appropriate (power) transformation of the dependent 
variable. 
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Selecting the Box-Cox transformation option will produce 
a plot of the Residual Sum of Squares , given the model, as 
a function of the value of lambda, where lambda is used to 
define a transformation of the dependent variable, 

y» _ (y**(i am bda) - 1) / ( g**(lambda-l) * lambda) if lambda >0 

y’ = g * natural log(y) if lambda = 0 

in which g is the geometric mean of the dependent variable 
and all values of the dependent variable are non-negative. 
The value of lambda for which the Residual Sum of Squares 
is a minimum is the maximum likelihood estimate for this 
parameter. It produces the variance stabilizing trans¬ 
formation of the dependent variable that reduces or 
eliminates the correlation between the group means and 
standard deviations. 

In practice, it is not important that you use the exact 
estimated value of lambda for transforming the dependent 
variable. Rather, as a rule of thumb, one should consider 
the following transformations: 


Approximate 

lambda 

Suggested 
transorfmation of y 

-1 

Reciprocal 

-0.5 

Reciprocal square root 

0 

Natural logarithm 

0.5 

Square root 

1 

None 
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